AAV-ML for Experimentalists #5: How to Assess ML Claims in AAV Without Being an ML Expert

TL;DR The scientist who knows when to trust ML results, and when to push back, is the most useful scientist in the room. Be that Scientist.

There is an asymmetry built into every conversation between an ML practitioner and an experimental AAV scientist, and it works against you.

ML practitioners report results in a language that was not designed for experimental decision-making: AUC, precision-recall, held-out accuracy, latent space coverage. These are real concepts with real meaning, but they are also words that land with the weight of rigor even when the underlying work does not earn that weight. A 90% accuracy sounds like a good thing. AAV scientists are trained, correctly, to trust quantitative results. You were taught that numbers mean something. That discipline is a virtue. In this context, it is also a vulnerability.

The result is a field where genuinely useful ML work gets adopted uncritically because the metrics looked good, and where other genuinely useful work gets dismissed because someone asked a sharp question the presenter could not answer. Both failure modes are expensive. The first wastes your experimental budget. The second wastes a real opportunity. What you need is not more skepticism — it is better judgment. And judgment, unlike expertise, can be built without a PhD in machine learning.

The Landscape of Claims You'll Encounter

Not all ML claims carry the same stakes, and they do not all have the same blind spots.

A peer-reviewed paper has survived some form of scrutiny, but peer review in computational biology is uneven, and reviewers rarely have the domain knowledge to catch a mismatch between the training data and the application claim. Publication is not validation.
A conference talk is almost always preliminary, and the presenter controls the narrative: they chose which results to show, which comparator to use, which failure to leave on the cutting room floor.
A vendor pitch is optimized for conversion, not for your experimental success. The incentives are misaligned by design.
A collaborator proposal is the most nuanced case; here the person presenting genuinely wants the work to succeed, but "wanting it to work" and "having evidence it will work on your biology" are different things.

Each context requires a different level of scrutiny and a different set of follow-up questions. But the three questions in the next section cut through all four.

The Three Diagnostic Questions That Cut Through Everything

To assess ML calims you hear/ read for the first time

The first: compared to what?

Every performance claim is a comparison, whether or not the comparison is stated. When someone says a model achieves 90% accuracy, or predicts productive variants 3x better than random, the number only has meaning relative to a baseline.

The relevant question is whether that baseline is fair.

Random selection is a low bar. Often embarrassingly low. Because the sequence space being explored is already constrained by prior biology.
If the comparator is "random variants from our library," ask what that library looked like.
If the comparator is a previous ML model, ask whether anyone has compared the result to what an experienced capsid engineer would have predicted from sequence conservation alone.
The simpler the baseline you can beat, the less the performance number tells you.

In AAV specifically, parent serotype performance is often the meaningful benchmark. A model that selects AAV9 variants with 40% improved muscle transduction in mice is saying something real. A model that outperforms a scrambled capsid library in HEK293 packaging is saying something much more limited.

The second: measured on what data?

This is the question most people skip because it feels technical. It is not. It is a biology question dressed in computational clothing.

A model learns from data; what it learns to do well is predict things that look like that data.

If the training data came from a pooled packaging screen in HEK293 cells using an AAV2 backbone, the model has learned something about AAV2 behavior in HEK293 cells. That is a real thing to know. It is not the same thing as knowing how your AAV9 construct will perform in mouse liver after systemic administration.

The gap between training context and application context is where most ML failures happen in this field.

Not because the model is broken, but because it is being asked a different question than the one it was trained to answer.

You cannot see this gap in the performance metrics.

The model will report confident predictions either way.

This is why the data question is always the right question; it does not require you to understand the model at all. You only need to understand the biology of where those numbers came from.

The third: does the metric match the biology?

Predicted fitness is not manufacturability.

Manufacturability is not transduction efficiency.

Transduction efficiency in a cell line is not in vivo tropism.

These are different biological quantities, and they are only weakly correlated.

An ML model optimized for predicted fitness (a composite score derived from a pooled screen) has been optimized for that number.

If your experimental goal is something else, you are not flying with a broken instrument; you are flying with the wrong instrument entirely.

Ask: what exactly was the model trained to predict, and is that the same thing I need to know?

If the answer is "close but not quite," that is important information.

It does not mean the model is useless; it means you need to think carefully about how to interpret its outputs for your specific application.

Red Flags and Green Flags

Some patterns I have seen repeatedly; named here so you can recognize them.

Red flags:

no experimental validation,
or validation performed on the sequences used to train the model (this inflates performance in ways that are hard to detect without reading the methods carefully);
metrics reported without context: an AUC of 0.90 means something very different in a balanced binary classification with a strong negative class versus an unbalanced screen with 0.1% functional variants;
"we tested 10,000 variants" with no description of how those variants were generated or selected (a biased library gives biased results regardless of scale);
and benchmarking exclusively against older ML models rather than against wet lab intuition or simpler computational tools. That last one is worth pausing on. If the model cannot outperform a trained capsid engineer's intuition about which residues to mutate, the model's contribution to your workflow is limited.

Green flags:

explicit description of the training data with source, size, serotype, assay type, and experimental conditions;
a validation set held out before model development began, with results on that set reported separately;
uncertainty estimates alongside predictions: a model that tells you "I am confident about this variant and uncertain about that one" is more useful than one that gives you a ranked list with no error bars;
and a limitations section that reads like it was written by someone who actually looked for the limitations, not someone trying to minimize their significance. That last one is rare. When you see it, it means something.

What "Performance" Numbers Actually Tell You

You do not need to understand the math. You need to know what question the number answers.

AUC tells you whether the model ranks positive examples above negative ones, on the data it was evaluated on. It does not tell you whether the positive examples in that evaluation look like the variants you care about, or whether the model's rankings will hold when you move to a different serotype or a different assay.
Accuracy tells you how often the model is right, on the held-out test set, which was drawn from the same distribution as the training data. The critical issue is that your experiment is a different distribution. You are asking the model to extrapolate, and extrapolation performance is almost never what the paper reports.
The distinction between interpolation and extrapolation matters more in capsid engineering than almost anywhere else in protein ML. Models are generally good at interpolating within the sequence space they have seen: predicting the behavior of variants that are similar to their training examples. They are much less reliable when extrapolating to novel regions of sequence space, which is often exactly where the most interesting engineering lies. When a paper reports strong performance, ask implicitly: are those test sequences similar to the training sequences, or genuinely novel? The answer changes everything about how much weight to give the result.
Calibration, how well the model's confidence scores match its actual accuracy, matters more than raw accuracy for experimental prioritization. A poorly calibrated model that reports high confidence on predictions it gets wrong will send you down expensive experimental dead ends. When evaluating a model for use in your workflow, the question is not just "is it accurate" but "does it know when it is uncertain?"

Evaluating the Claim in Context

Before you evaluate what a study claims, ask who funded it and who benefits from the conclusion.

This is not an accusation. It is provenance. Academic work, industry work, and sponsored research each carry different structural pressures. A paper published by a gene therapy company about their own platform technology is not necessarily wrong, but the incentive to show positive results is real, and the cost of publication delay for a negative result is lower than the cost of a public negative result. A paper from an academic lab may be more transparent about limitations but may also lack the operational context that makes a method usable at scale.

In AAV biotech specifically, secrecy is structural. Key training datasets are proprietary. Methods sections moslty describe workflows in terms that are technically accurate but insufficient for reproduction, or they are just conference abstracts with no details while the paper comes to public after some years when the method is no longer novel. Publication timelines are shaped by patent strategy, not by the readiness of the science. None of this is bad faith, it is the competitive reality of a field where sequence data and screening infrastructure are genuinely hard to build. But it does limit what you can conclude from any single source. A model you cannot reproduce is a model you cannot audit. You can still use it, but you should use it knowing that you are trusting someone else's evaluation of their own work.

The Five Evaluative Questions to Ask Before You Commit

Use these before you act on any ML result. They are not a checklist to complete; they are a habit to build. The scientist who asks them consistently, in every context, is the one who knows when the evidence is strong enough to move on and when it is not. That judgment does not expire.

1. What exactly was predicted, and what was measured to validate it?

This question separates models that predict a proxy from models that predict what you actually care about. A model optimized for packaging efficiency in a pooled screen has learned to predict packaging efficiency in a pooled screen, not transduction in your target tissue, not manufacturability at scale, not in vivo performance in your species.

The gap between what was predicted and what you need to know is often where the most expensive assumptions hide.

2. What is the baseline, and is it a fair comparison?

A model that beats random on a biased library is not the same as a model that beats your best experimental guess. If the comparator is a scrambled capsid library in HEK293 packaging, the bar is low enough that almost any model clears it. And the result tells you almost nothing about whether the model will help you find better variants than you would have found yourself.

The meaningful baseline is not random. It is what an experienced capsid engineer would have proposed from sequence conservation, structural knowledge, and prior screens.

3. Is the training data distribution relevant to my biology?

A model trained on AAV2 HEK293 data is not a general capsid oracle; ask before you apply it. This is not a flaw you can see in the performance metrics. The model will generate confident predictions regardless of whether your biology resembles its training context. A model trained on mouse CNS data may have nothing to say about NHP liver. A model trained on peptide insertions in the AAV2 28-mer may not transfer to your AAV9 VR8 region.

The training context is the result. And it is almost never stated prominently enough.

4. Has this been tested outside the lab or company that built it?

Independent replication is rare in this field, which means you should weight it heavily when it exists. A single group validating their own model is a starting point; it tells you the approach is not obviously broken. Independent replication in a different lab, on a different serotype, with a different assay, tells you the approach is generalizably useful. Those are different claims, and the field has far more of the former than the latter.

When you see independent replication, note it. When you do not, factor in that absence.

5. What would it take to falsify this claim?

If the presenter cannot answer this, the claim is not scientific, it is a narrative. Every genuine scientific claim has a condition under which it would be wrong. For an ML model, that might be: a prospective validation set where the model's top-ranked variants underperform a naive baseline, or a serotype transfer experiment where the predictions do not hold.

If the answer to this question is "nothing could falsify it" or if the question produces confusion rather than a clear answer, you are not evaluating a scientific result. You are evaluating a story.

What to Ask For

Identifying a problem and knowing what to do about it are different things. Here is the language that closes that gap.

In a vendor demo or a conference conversation: "Can you share the held-out validation results, specifically the test set that was separated before model development began?" This distinguishes a model that was genuinely evaluated from one where the evaluation was shaped by the results.
When a collaborator proposes using an existing model on your program: "What serotype and assay type was the training data from, and how different is that from our target biology?" You are not asking them to defend the model, you are asking them to think with you about the transfer gap.
When a model output comes with a ranked list and no uncertainty: "What is the model's confidence on these top predictions, and how does that confidence behave at the edges of the training distribution?" A collaborator who cannot answer this does not yet have what you need.
When a claim feels too clean: "What would a negative result look like for this model, and have you seen one?" The answer to this question tells you more about intellectual honesty than any performance metric.

Critical evaluation is not skepticism for its own sake. Skepticism applied indiscriminately is just friction. What you are building here is judgment: the capacity to know when the evidence is strong enough to act on and when it is not. That judgment does not require you to become an ML scientist. It requires you to ask the right questions with enough specificity that a non-answer is itself informative.

The scientist who walks into an ASGCT session and can distinguish a genuine result from a compelling narrative is not the most skeptical scientist in the room. They are the most useful one.

Next in this Series: Post #6 closes the series with the other side of this relationship: once you know how to evaluate ML work, how do you actually collaborate with the people doing it? What do ML teams need from you, your data, your biological priors, your experimental constraints, to do science that holds up? That is where this has been heading all along.

PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Search This Blog

The AI × AAV Interpreter