Validating Medical AI Algorithms

March 2025 , Sam Moreland

Depending on the type of approval process you are using (see clearance types) you will have 2 main methods of validating your algorithms, controlled clinical trials and real world evidence. These 2 methods are not mutually exclusive but can be very divergent depending on your algorithms and use case. The main differences come down to the method of labelling and the costs/generalisability trade-off. The rest of the article will treat this in a medical context, but is still highly relevant for non-medical settings.

1 - Types of Trials

Controlled Clinical Trials (CCT)

The vast majority of algorithms that are cleared by validating in a controlled setting. This means that participants (or patients) are placed in a a tightly controlled environment and a single factor is changed in order to view different responses. With CCT the labelling of the data is done inherently in the method of data capture and don’t usually require post-hoc labelling.

Example: I’m creating a respiratory rate algorithm for my ECG patch. I get all participants to sit in a upright position, connected to the same EtCO2 monitor, and ask them to breath at fixed rates (1 minute) of 6, 15, 30, 40, 60 breaths per minute.

Sometimes, depending on the sophistication of the field, synthetic data can be used in place of patient data.

Real World Evidence (RWE)

This is a bit more self explanatory. You gather your data in a non-controlled environment and get experts to label the data afterwards.

Example: Participants wear a watch with a PPG sensors which collects data. This data is then labelled by a nurse for PPG peaks to create a pulse rate algorithm.

2 - Whats the difference?

I’m going to go over the trade-offs and nuances in methods for your validation activities.

Data generalisability

-Patient Population

CCTs may not be able to include your desired patient population. I worked on a respiratory rate algorithm that measures people breathing from 6-60 breaths per minute (BrPM). Our target patient population were people with COPD. Even if we physically could get COPD patients to sustain breathing at 60 BrPM, it would be super unethical to do so. This is similar to SpO2 where you need to desaturated participants down to 70 % or even 60 %, which even for the fittest people is very hard.

RWE however can be captured on you desired patient population (assuming you’ve correctly identified the intended use case).

-Environment

CCTs may not be able to be collected in a real world environment. Going along with the previous SpO2 example, you need a lot of specialist equipment in a lab to be able to perform the validation. This is not the same environment as a patients bedroom at night.

Here RWE can capture data in the expected environment, however it may not be able to capture the validation data that you need. There is no way you would be able to monitor the SpO2 level of someone who is sleeping naturally at home as it requires a blood draw and gas analysis.

-Measurement Range

CCTs can be controlled in a away that you are able to gather all the data you need. For RWE you may never get the needed measurement range for your product. For heart rate you need to measure between 30-240 BPM, 240 is a huge outlier and it is highly unlikely you would ever get over 200 BPM in your real world set.

Time

CCTs can be performed in a few months wheres RWE can last for a long time (years / decades) depending on your endpoints.

Cost

It my not come as a surprise that CCTs can be orders of magnitude cheaper that RWEs which require specialist doctors to manually label the data.

Typical costs of CCTs are $ 10 Ks -> $ 100 Ks whereas RWEs can be $ 100 Ks -> $100 Ms.

3 - Other General Considerations

These considerations are for both CCTs and RWE.

Users

Who will be using your product. Doctors and nurses may use your product differently, can elderly patients apply your ECG patch correctly etc. Different users can and will use your product in ways you cannot imagine.

Hardware Agnostic

If your wanting to develop an algorithm thats hardware agnostic, you’ll need to understand the hardware that you have been testing on. You’ll likely have to determine the minimum performance characteristics needed for your needs and prove out on multiple different units.

Algorithm Iteration

Predetermined Change Control Plans (PCCPs) are fairly new and have the ability to vastly improve patient care. They are an agreement with the FDA on the methods of algorithm improvement that you can do without having to submit a new filing. Usually this is on improving your current model in its environment rather than making a completly different model in a new environment.

New filings will be needed if you are taking a less risky intended use to a more risky intended use (informational to diagnostic), involve new sources of data that you’ve not previously supplied, or architecturally new algorithmic methods (going from random forrest to neural network).

GenAI

GenAI is different to other neural network based approaches and the regulators recognise this. While there are currently no specific guidelines on the specific use of GenAI they have said:

“the performance evaluation methodologies needed for sound oversight “will be governed by the specific intended use and design of the GenAI-enabled device, some of which may necessitate formulation of new performance metrics for certain intended uses.” “ 

You should prepare for things to change very fast.

4 - How should I validate?

I’ve just thrown a lot of information at you and the answer is it depends. Some uses are just impossible to validate using RWE like SpO2. Some uses have little difference such as in radiography. For me I have 5 steps.

  1. Generate a CCT dataset thats used to give best case performance when being used directly according to the intended use.

  2. Generate a subset CCT dataset of your target patient population who may not be able to do the full CCT but can do partial (COPD patients breathing at low frequencies).

  3. Generate as much unlabelled RWE as possible and use some unsupervised statistics to understand the real world performance. If your data tell you that 90% of your patients are breathing at 50-60 BrPM, you’re algorithm is definitely not generalising.

  4. As your product is used more, prioritise labelling your edge case RWE as you learn more.

  5. Set up a good feedback mechanism internally to find outlier data values and forward customer success reports. These two sources will tell you about abnormal situations and data. You can target these situations to find weaknesses in your product and improve them (and also be integrated into your CAPA process).

Previous
Previous

6 Steps To Create FDA/CE Compliant Data Products

Next
Next

Types of FDA clearance for Medical Devices (and their differences).