AAV-ML for Experimentalists #6: Working With ML Teams, What AAV Scientists Actually Need to Know

TL;Dr Do not set your collaboration to failure from Day 1. Learn how to speak with ML teams.

The typical AAV-ML collaboration failure scenario goes like this:

The collaboration looked good on paper. An ML team with a strong publication record, an experimental team with three years of capsid library data, a shared interest in CNS targeting. The first few meetings were energizing, the kind where everyone leaves feeling like something real is about to happen.

Six months later: a ranked list of 200 variants, a brief methods section explaining that a transformer-based model was trained on the pooled screen data and optimized for predicted transduction scores, and a looming decision about which 20 variants to actually synthesize.

Nobody can clearly explain why variant 47 ranked above variant 12. Nobody wants to be the one to say that out loud. The experimental team is wondering whether to trust the list or quietly fall back on their own instincts. The ML team is waiting to hear which variants got made so they can close the loop.

This is not a story about bad science or bad people. The ML team did real work. The experimental team provided real data.

What failed was the scaffold between them: the shared language and explicit agreements that would have allowed ML outputs to become experimental decisions.

That scaffold almost never gets built at the beginning of a collaboration, because neither side knows quite what to ask for and both sides are optimistic enough to assume it will work itself out. It rarely does.

Here is how to build it before you need it.

This is the reference guide no one has written before.

You Are Not Just the Data Provider

Most experimental scientists enter an ML collaboration thinking of themselves as the data provider. The ML team is the intelligence layer: they take your data and produce something you could not have produced alone. Your job is to generate good data and stay out of the way.

This framing is wrong, and it costs people real scientific value.

Every decision you make in experimental design is a constraint on what the model can learn and what it can generalize to. The cell line you used for your packaging screen is not just a detail; it is a boundary condition on model applicability. The range of variants you chose to synthesize shaped the sequence space the model has seen, which determines where it can interpolate versus where it is extrapolating blindly. The readout you used, whether that was a relative packaging efficiency, a transduction ratio, a cell-type-specific tropism score, is the thing the model optimized for. If that readout is a proxy for what you actually care about rather than the thing itself, the model optimized for a proxy.

Your assay design decisions are model architecture decisions. They just happen before the model is built.

Once you see the collaboration this way, what you bring changes. You are not just a data provider; you are a co-designer of the model's problem formulation. And that means the documentation you generate before the data transfer is as consequential as the data itself.

Before handing over any dataset, document three things explicitly.

First: assay conditions with enough specificity that someone outside your lab could reproduce the readout. This means passage number, cell line source and authentication date, transfection reagent and ratio, plate format, time of harvest, and how you normalized across plates. These details are almost never in a methods section and they are exactly the details that matter for whether a model trained on your data will generalize to your next experiment.
Second: variant selection rationale: why those sequences and not others. If your library was designed around a particular structural hypothesis or was constrained by synthesis cost, that selection bias is baked into everything the model learns. The ML team needs to know it is there.
Third: what the readout does and does not capture. If your tropism score reflects HEK293 transduction but your actual target is primary hepatocytes, write that down. Explicitly. The model cannot know the distance between its training signal and your real goal unless you tell it.

ML Scientists Underestimate Experimental Constraints

ML scientists routinely underestimate experimental constraints, not out of arrogance but out of genuine unfamiliarity with what it costs to move from a ranked list to a synthesized, produced, and tested variant. If you have not spent time at the bench, the gap between "generate 200 predictions" and "test 200 variants in vivo" is easy to miscalibrate.

The way you communicate constraints matters as much as the fact that you communicate them. "We can only test 20 variants" sounds like a resource limitation, a ceiling imposed by budget or bandwidth that the ML team should work around as best they can. That framing puts you in the position of apologizing for a constraint rather than specifying a requirement.

Try this instead: "The model needs to be calibrated for top-k selection where k equals 20, under production timelines of six weeks per cohort, with each variant requiring independent AAV production from triple transfection."

That is not an apology, it is a design specification. It tells the ML team that uncertainty quantification matters, that rank stability across small k is more important than average performance across the full list, and that a model that gives you 200 mediocre predictions is less useful than one that gives you 15 confident ones and flags the rest as uncertain. Those are different model requirements, and they are worth specifying before the model is built rather than after you have a ranked list you do not know how to use.

The specific question to push on early: what happens at the boundary of the top-k? If the model ranks variant 20 and variant 21 essentially identically (confidence intervals overlapping, feature drivers similar, etc), you need to know that before you make your synthesis decision, not after. Ask for it explicitly: "Can you show me the confidence distribution across the top 30 predictions, and flag anywhere the ranking is unstable?"

That is a sentence you can say in a meeting. It will either produce a useful answer or reveal that the model cannot provide one, which is itself useful information.

How to Push Back on ML Outputs Without Derailing the Relationship

You will look at a ranked list at some point and something will feel wrong.

A variant ranked in the top 10 will conflict with what you know about that region of the capsid, maybe it changes a residue you have seen abolish infectivity in three separate experiments, maybe it introduces a motif that creates immunogenicity risk you do not want to take. Your instinct is to push back. The question is how to do it in a way that is scientifically productive rather than relationally corrosive.

The first move is always to ask for the model's reasoning before expressing your skepticism. Not because your skepticism is wrong, it may be exactly right, but because you need to know whether the model is weighting something you are discounting or discounting something you are weighting heavily.

The sentence is: "Before I share my read on this, can you show me what the model is weighting most heavily for these top variants?" That question does two things. It gets you information you need to evaluate whether the model's reasoning is sound. And it signals to the ML team that your pushback is analytical rather than territorial.

When your biological intuition conflicts with the model's output, bring it in as data rather than opinion. The difference in practice: "I'm skeptical about variant 7, that loop region has been problematic in my hands" is an opinion. "Variants with substitutions in that loop region showed a 40% reduction in packaging efficiency in our 2022 screen; can we check whether that data was included in training and how it was weighted?" is a data point that opens a diagnostic conversation. The first framing puts you and the ML scientist on opposite sides. The second puts you on the same side, looking at the same problem together.

When the disagreement cannot be resolved by examining the model's reasoning, propose a joint diagnostic rather than a unilateral decision. "I want to understand why the model and my intuition are diverging here. Can we pick two variants where we agree and two where we disagree and test all four? That gives us data on whether the model is seeing something I'm missing or vice versa." This is the most useful sentence in an ML collaboration when things are not going well. It converts a disagreement about a ranked list into an experimental question, which is the domain where you have authority and where the answer will actually help both sides.

How to Know When a Collaboration Is Set Up to Fail

Not every ML collaboration is worth continuing. Some are structured in ways that cannot produce useful experimental guidance regardless of how well everyone communicates. Recognizing that early, before you have spent six months and a significant synthesis budget, is one of the most valuable judgments you can develop.

The first warning sign: the ML team cannot clearly state what biological question the model is answering. Not what it predicts; what question the prediction is meant to answer. If the answer is "it predicts your transduction score" and stops there, the model is answering a statistical question, not a biological one. A collaboration where the ML and experimental teams have not agreed on the biological question the model is meant to illuminate is a collaboration that will produce outputs nobody knows how to act on.

The second: there is no plan for experimental validation before scale-up. A model should earn the right to guide large experimental decisions through small validated ones first. If the collaboration structure goes directly from model output to a 200-variant synthesis run with no intermediate checkpoint, no pilot set of 10 or 15 variants chosen to test the model's predictive accuracy on your specific biology before committing to the full list, the collaboration is asking you to absorb all the risk of model failure in a single expensive bet.

The third: data ownership and publication expectations are undefined at the start. This conversation is uncomfortable to have early and catastrophic to avoid. Who owns the trained model? Who can publish what, and when? If your library data ends up embedded in a model that gets published or patented before your own work is out, that is a problem you cannot fix after the fact. The discomfort of the early conversation is much smaller than the cost of the late one.

The fourth: the ML team's incentives end at the model. If the collaboration structure rewards the ML team for building and publishing a model but not for whether the variants it selects actually work in vivo, the incentive gradient points toward a sophisticated model and away from a useful one. That is not bad faith; it is a structural misalignment that shapes behavior in ways neither side may fully recognize.

The fifth: when you ask a question the model cannot answer, the response is a reframing of the question rather than an honest acknowledgment of the limit. Every model has boundaries. ML scientists who are worth collaborating with know where their model's boundaries are and will tell you directly when you are asking something outside them. The ones who respond to every hard question by explaining what the model can do rather than what it cannot are telling you something important about whether this collaboration will be honest when it needs to be.

Before sharing any data with a new ML collaborator, get answers to five questions:

What biological question is this model designed to answer?
What experimental validation plan exists before we commit to full-scale synthesis?
What are the data ownership and publication terms, in writing?
What happens to the collaboration if the first validation set shows poor predictive accuracy?
What does the ML team need from me specifically, not generically, to make this model work for my biology?

If any of these questions produces evasion rather than an answer, you have information.

Final Thought

The best ML collaborations in AAV are not the ones with the most sophisticated models. They are the ones where the experimental scientist understood what the model needed, communicated what the biology required, and built enough shared language with the computational team to make the iteration loop fast and honest.

I have been on the ML side of that table, watching experimental collaborators hand over data without metadata that would have doubled the model's utility. I have been on the experimental side, receiving ranked lists I did not know how to interrogate.

The gap between those two vantage points is not a technical problem. It is a communication problem, and communication problems are solvable.

That shared language starts with you. Because you are the one who understands both the biology and, now, enough of the ML to ask the questions that close the gap.

This is the final post in AAV-ML for Experimentalists. The full series (six posts, covering generative and predictive models, experimental workflows, applications, claim evaluation, and this) lives on the AIxAAV Interpreter. It was built for bench scientists and PIs who want to work with ML without outsourcing their scientific judgment to it. More is coming: deeper dives, specific tools, and the kinds of case-level analysis that a six-part series cannot fit. If this series gave you something useful, the best thing you can do is share it with one person in your lab who would benefit from it.

The field gets better when more people can ask better questions.

PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Search This Blog

The AI × AAV Interpreter