AI software validation makes most Quality teams uneasy.
It’s not unjustified. The new Annex 22 provides a framework for AI use, however the guidance remains in draft, and most Quality teams still have questions, especially around validation.
Why? Quality teams are trained to validate systems that behave predictably, but depending on the AI feature, predicting the exact output from these tools is tough.
The good news is that AI can be validated, you just need to adjust how you think about expected results, risk, and acceptance criteria.
Traditional software validation is built around deterministic results, which means the same input always produces the same output. It’s clear, consistent, and repeatable.
Here’s an example:
You want to validate that the user access controls in an eQMS work as intended. Specifically, you’re confirming that an unauthorized user cannot approve a document. You create a test case where:
You log in as a user with a “Viewer” role
Navigate to a draft document
Attempt to select “Approve” `
The expected result is that the “Approve” option is disabled. Every time you run the test, the result should be identical. A passing test will always result in the “Approve” button disabled.
However, AI tools produce non-deterministic results, meaning the same input can often give you different outputs.
For example, imagine using an AI tool to analyze an SOP and generate five training questions. Even when given the exact same prompt, the AI might generate different questions each time. But “different” doesn’t mean “wrong.” Each of those alternate outputs could be correct depending on your initial criteria.
This is where Quality teams get stuck during the validation process. If you can’t predict the exact output, how do you define an expected result? And if there’s no exact expected result, how do you prove the test passed?
Just like any other validation project, AI validation starts with two foundational steps:
1. Define the intended use
2. Perform a risk assessment
These two steps determine how much validation is required and how detailed your testing needs to be.
When you’re categorizing risk for an AI tool, ask questions like:
Is the AI advising, assisting, providing decision-support, or making autonomous decisions?
What happens if the AI output is wrong?
Would a mistake be hard to detect?
Would it affect patient safety or compliance?
Can users override or reject AI output?
The lower the risk attached to the AI tool, the lighter the validation lift. The more risk the tool introduces, the more intense the process.
For example, ZenQMS offers an AI Smart Search tool which allows the user to type the exact type of documents they’re looking for in plain language (i.e. “I need documents written by Jane Doe effective after January 1, 2025.”) and automatically generates the necessary search filters.
The tool reduces the time it takes to find a document (always a plus for Quality teams) but doesn’t impact compliance, patient safety, or critical decision making. It’s a low risk AI tool with a smaller validation requirement.
Low-risk AI tools may only require a test scenario instead of full test cases.
A test scenario is a high-level check that a tool meets its intended use. A test case details the steps, inputs, expected results, and actual results.
A test scenario for the AI Smart Search tool might read "Verify the correct filters are applied according to the prompted requirements.” After a test run of the tool, the tester might document a comment like "AI Filters functioned accurately, creating the correct filters based on the prompt" and then include a screenshot of the output as evidence.
For higher-risk AI tools, you’ll need structured test cases with clearly defined acceptance criteria, which brings us back to the main question surrounding AI validation: How do you set an expected outcome for an AI tool?
AI validation looks pretty similar to normal software validation. The difference mostly lies in how you define the expected outcome and how you assess a pass or fail.
For AI validation, the expected outcome focuses more on the intention of the tool rather than a literal result.
Let’s go back to the AI training question generator example. Your expected outcome won’t say “The system generates the 5 following training questions…”, but it could say “The tool generates 5 questions that test the user’s knowledge of the core components of the SOP.”
After the system generates a result, the SOP subject matter expert (SME) evaluates the response and determines whether it meets the requirements and the original intention. The prompt, the output, and the reasoning for pass or failure are all documented as objective evidence.
This human touch is the key.
AI outputs can vary, so a passing test takes more than a simple comparison against an exact expected outcome. You must incorporate an SME review into your validation process to assess and validate whether the output is acceptable.
At the end of the day, Quality teams are responsible for validating the systems they use. However, AI introduces an added layer of dependency on the vendor.
AI tools rely on models, configurations, and controls that live largely behind the scenes. That means part of your risk assessment should include how your vendor manages and monitors their AI tools.
Before you adopt an AI tool, come prepared with your list of questions, like:
Remember, validation falls on the Quality team (not the vendor), but life sciences teams need to trust that their vendor understands the regulatory environment they operate in and has designed their AI features with compliance in mind.
AI doesn’t change the fundamentals of validation. Once you shift the focus from exact results to acceptable outcomes, the path forward is clear.
At the end of the day, validating AI software comes down to:
Clearly defining intended use
Performing a thoughtful risk assessment
Establishing what “acceptable” looks like
Applying qualified human judgment
Documenting decisions with objective evidence
AI should – and can – support Quality teams the same way good software always has: by reducing manual effort, improving consistency, and allowing Quality teams to focus on what matters most.