← Back to Notes
Regulation2026-03-29

From Training to Clearance: Validating AI Models in Healthcare

When working on AI for healthcare, validation is not just a final step. It is something that starts early and continues even after deployment. The objective is simple in theory: build systems that behave as expected, remain stable over time, and are safe in practice. In reality, each stage reduces uncertainty step by step.


Starting with the problem

Everything begins with a clear definition of what the model is supposed to do. This sounds obvious, but small ambiguities at this stage often create issues later.

It helps to think in concrete terms:

  • what goes in (inputs),
  • what comes out (outputs),
  • and who will use it.

For example, detecting brain hemorrhage on CT scans in an emergency setting implies time constraints, high sensitivity requirements, and direct interaction with radiologists. The context already shapes the technical choices.


About the data

Data is where many practical challenges appear. A model trained on clean, homogeneous data may not behave the same way in real settings.

In practice, the goal is to get data that reflects variability: different hospitals, devices, populations, and acquisition conditions.

Some recurring points to watch:

  • imbalance (rare pathologies),
  • differences between scanners,
  • annotation variability between experts.

Curation is not only cleaning data, it is also understanding what the data represents and what it might miss.


Training the model

At this stage, the workflow is more standard: split the data, train, adjust, repeat.

One habit that becomes essential is protecting the test set. It should remain untouched until the end, otherwise performance estimates become optimistic without being noticed.

For imaging tasks, such as CT classification, convolutional models are often used, but the architecture matters less than the discipline around evaluation.


First checks: internal validation

Internal validation gives a first idea of performance. Metrics like sensitivity and specificity are useful, especially in medical contexts where missing a case and raising a false alarm do not have the same implications.

For instance:

  • high sensitivity helps detect most critical cases,
  • specificity helps avoid unnecessary alerts.

These numbers are reassuring, but they reflect only the data used during development. They are not the full picture.


What changes with new data

Things often shift when the model is tested on external data. This is where generalization is really assessed.

Using data from a new hospital or a different population tends to reveal gaps: performance may decrease, sometimes significantly.

This step is less about achieving high numbers and more about understanding how the model behaves outside its comfort zone.


A useful distinction

It is helpful to separate two ideas:

  • verification: does the system meet its technical specifications?
  • validation: does it actually help users in practice?

A model can be correct from a software perspective and still not be useful in a clinical workflow. Keeping both perspectives avoids focusing only on metrics.


Testing beyond the “ideal case”

Real-world data is rarely clean. Images can be noisy, incomplete, or acquired under different conditions.

Testing the model under such variations helps reveal its limits:

  • adding noise,
  • degrading image quality,
  • changing acquisition parameters.

These tests are less about passing or failing, and more about identifying where the model starts to break.


Moving closer to practice

At some point, evaluation needs to include real users. This can take different forms, from retrospective analysis to reader studies.

A common setup is to compare clinicians with and without the model, and observe how decisions change. The question becomes more practical: does the tool actually support the task?


Thinking in terms of risk

Before deployment, it is necessary to think about what could go wrong.

For example, missing a hemorrhage may delay treatment. This leads to mitigation strategies, such as alerts or keeping a human in the loop.

This type of reasoning shifts the focus from performance alone to consequences.


Writing things down

Documentation often feels secondary, but it plays a central role. It forces clarity about:

  • how the model was trained,
  • what data was used,
  • what the known limitations are.

It also becomes necessary for regulatory processes.


About regulatory clearance

For deployment as a medical device in the United States, clearance from the U.S. Food and Drug Administration is required.

A common pathway is the 510(k), which involves showing that the system is comparable to an existing approved device, along with evidence on performance and safety.

This step formalizes much of the work done earlier: intended use, validation results, and risk analysis.


After deployment

Deployment is not the end. Data evolves, devices change, and usage patterns shift.

Monitoring helps detect:

  • data drift,
  • performance changes over time,
  • unexpected user behavior.

Feedback from real use becomes a valuable source of information.


Updating the model

Over time, updates may be needed. These can include retraining with new data or small adjustments to the model.

To manage this safely, a predefined framework such as a Predetermined Change Control Plan (PCCP) can be used. The idea is to allow improvements while maintaining control over performance and risk.


Final reflection

Validation is not a single checkpoint. It is a continuous process that starts with defining the problem and extends throughout the model’s lifecycle.

Each step adds a layer of understanding: first about the data, then about the model, and finally about its behavior in real conditions.

The goal is not to eliminate uncertainty completely, but to reduce it enough to make the system reliable in practice.