AI Is Not Coming for AAV Scientist Jobs. We Are Worried About the Wrong Thing (~10min read)

TL;DR AI does not have enough data to replace you. You are the one making both data and AI.

Every few days, someone slides into my inbox with a version of the same question. Sometimes it comes from a postdoc who just watched a talk on generative AI. Sometimes it's a VP of Research quietly calculating headcount. Sometimes it's a senior scientist who has spent fifteen years building expertise in capsid biology and is now watching breathless conference presentations about models that can "design novel AAV variants at scale."

The question is always a variation of: "Should I be worried?"
The honest answer is: not about what you think.

The fear is understandable. The headlines are real — generative AI for protein design, autonomous agents in drug discovery, models that propose capsid sequences at a scale no directed evolution campaign could match. If you're an experimental AAV scientist watching this from the bench, the anxiety makes sense. Something is clearly changing.

But the anxiety is aimed at the wrong target. And to see why, you have to work through three realizations in order. Each one reframes the previous. By the end, the question you're asking should be completely different from the one you started with.

Realization One: AI Barely Works Here Yet — and Not for the Reason You'd Guess

Before the question of replacement even becomes relevant, there's a more immediate one: can AI actually function reliably in AAV engineering at all?

The answer is: partially, with significant caveats.

And the limiting factor isn't algorithmic sophistication. It isn't model architecture, compute, or the cleverness of the researchers building these systems.

It's data. Specifically, the near-total absence of the kind of labeled, functionally characterized, experimentally validated data that makes machine learning actually work.

The coverage gap is larger than most people realize

As of the latest UniProt release, there are approximately 246 million protein sequence records in the database. That sounds like a foundation for powerful models. It is not. Fewer than 600,000 of those entries have been manually reviewed and experimentally annotated by expert curators — roughly 0.25% of deposited sequences have anything approaching rigorous functional characterization. The remaining 99.75% are computationally predicted or automatically annotated based on sequence homology to other proteins.

Now zoom in further, to the level that actually matters for AAV work. Of that already tiny fraction with experimental annotation, only a subset have the kind of functional assay data that is relevant to capsid engineering — data that tells you not just whether a protein exists, but whether a specific variant, at a specific dose, in a specific tissue, in a specific species, produces a therapeutically relevant outcome without triggering immune clearance or off-target transduction.

Compare this to what powered the AI success stories you've been reading about.

Large text corpora used for LLM pretraining cover an estimated 80–90% of commonly produced human language — essentially the full distribution of what humans write, publicly available, instantly labeled, with ground truth that is the data itself.
Image models were trained on hundreds of billions of labeled or self-supervised images.
When a model gets something wrong in image recognition, you know immediately. You label it, retrain, and move on.
The feedback is instant. The coverage is extraordinary.

In AAV, you are working with a characterized fraction of sequence space so small it barely registers. The model is not learning the landscape. It is learning a handful of footpaths while the continent remains unmapped.

The rate at which the gap closes is throttled by biology itself

Here is the part that doesn't get said enough, and it matters enormously for understanding why this field is different from every domain where AI has succeeded.

It isn't just that the characterized fraction of sequence space is tiny. It's that the speed at which you can close that gap is fundamentally limited — not by money, not by compute, not by the number of scientists you hire — but by the physics of wet lab science in a complex biological system.

Think about what it actually takes to generate one meaningful labeled data point in AAV capsid engineering. You design a variant or a library. You clone, produce, and quality-control your vectors — already weeks to months.

You run your primary screen, whether that's cell-based transduction, biodistribution in mice, or a more demanding NHP model. You wait for the biology. You collect tissue, run NGS, process and interpret the data. For anything approaching a clinically relevant readout, you are looking at months or even years per experimental round, with throughput constrained by animal model availability, manufacturing capacity, and the irreducible timelines of in vivo biology.

In fields where AI has succeeded, data deficits are solvable with scale. Need more labeled images? Spin up a crowdsourcing annotation pipeline and have millions of new labels by next week. Need more text? Crawl more of the internet. The bottleneck is resources, and resources respond to money. A well-funded team can close a data gap in weeks.

In AAV, even the best-resourced lab in the world — unlimited budget, unlimited personnel — is still waiting weeks or months per experimental round. You cannot parallelize your way out of a six-week in vivo readout. You cannot throw compute at an NHP study to make it run faster. The data accumulation rate is set by the biology, and the biology does not negotiate.

The consequence of this is significant and underappreciated: not only is the current data gap enormous, but the trajectory of closing it is slow by design. Every year, the field generates more characterized data than the year before — but the rate of accumulation in AAV will never approach the rate that powered the image and language revolutions. This isn't a temporary state. It is the permanent condition of a field where labels are biologically expensive and experimentally irreplaceable.

This means that the question of whether AI will replace experimental scientists is not just premature — it has the causality backwards. The models need the experimental scientists to function at all.

Realization Two: The Field Has a Knowledge Problem That Makes Everything Worse

Even if the raw characterized data existed at scale, AAV engineering has a second structural problem layered on top of the first — one that almost never gets discussed plainly, perhaps because it implicates everyone in the field including the people writing about it.

AAV is one of the least knowledge-transparent technical fields in modern biology. And it is getting less transparent over time, not more.

The dramatic success of AI in computer science applications stems from a radically different data-availability story.

In computer science, sharing is the currency. In AAV, it is the exception.

In computer science, sharing is not just encouraged — it is how you establish priority, reputation, and influence. ArXiv preprints mean that major results are public before peer review. Open-source code means that a new method is reproducible by anyone within days of publication. Public benchmarks mean that every lab is competing on the same playing field with results that can be directly compared. When a new architecture outperforms the previous state of the art, everyone in the field knows within weeks. The code is on GitHub. Labs around the world are replicating it before the paper is formally published. The knowledge compounds rapidly and publicly because the incentive structure rewards sharing.

AAV does not work this way. And the reasons are structural, not cultural.

Capsid engineering — even work conducted in academic settings — frequently operates under industry partnerships, sponsored research agreements, material transfer agreements, or in the immediate shadow of patent applications. The commercial value of a novel capsid with demonstrated tissue-specific tropism in NHP is significant enough that the default posture, across the field, has shifted toward confidentiality rather than openness. This applies not just to the final results but to the methodology: which library design strategy was used, what the screening funnel looked like, how hits were validated, what the failure rate was, what didn't work and why.

The result is a literature that is systematically incomplete in the ways that matter most for building on prior work. You attend an ASGCT talk describing a beautifully performing capsid variant with robust in vivo data, and the methods section is three sentences long, if at all. The library diversity, the selection conditions, the assay parameters, the number of animals, the criteria for hit calling — absent. Not because the presenters are being careless, but because full disclosure is commercially inadvisable or contractually prohibited.

The consequence: everyone is solving the same problem in private

This knowledge opacity produces a specific and costly pattern that anyone who has spent time in this field will recognize: parallel reinvention at scale.

Every organization — large pharma, emerging biotech, academic lab with an industry partnership — is essentially running its own private experiment in capsid engineering ML. They are making their own choices about library design, their own choices about model architecture, their own choices about training data curation and validation strategy. With limited ability to triangulate against what the rest of the field has already tried. With no shared benchmark to tell them whether their approach is working well or poorly by any objective standard. With no mechanism for the field to collectively learn that a particular direction is a dead end, rather than having each organization discover that independently and expensively.

For an AAV researcher starting a machine learning project, this creates a problem that has no equivalent in computer science: you often cannot even establish what "good" looks like. The best-performing capsid engineering ML systems are proprietary. The datasets used to train them are proprietary. The validation experiments that established whether they worked were done under NDA and may never be published. Even the published academic results — which are at least nominally open — are reported with enough methodological omission that replication is functionally impossible. Few exceptions exist, of course.

AI tools built on public data are therefore working from a picture of what actually works in AAV that is not just incomplete but systematically biased toward whatever the field has chosen to make visible. Which is a small, non-representative, commercially curated fraction of what has actually been tried.

The knowledge moat in this field is not just technical. It is structural, it is reinforced by the economics of gene therapy, and it compounds the data scarcity problem in ways that are invisible if you are coming to AAV from a CS or general ML background.

Realization Three: Even With Perfect Data, the Hardest Decision Stays Human

So let's hold two things as given for a moment: the data existed at scale, and the field shared it openly. There is still a third problem — and unlike the first two, this one does not go away with more resources or better incentive structures. It is permanent.

The goal of AAV engineering is, by definition, to go where the model has never been

Every predictive model in biology is, at its core, a compressed representation of what has already been characterized. It learns patterns from the sequences and functional data it was trained on, identifies relationships, and extrapolates — carefully, when it is working well — into nearby unmeasured space.

But the explicit goal of AAV capsid engineering is not to find variants near known sequences. It is to design functional capsids that do not exist in nature, that have never been screened, that target receptors or tissue distributions or manufacturing profiles that may have no close analog in the training data. That is, by definition, out-of-distribution work.

When a model operates in-distribution — predicting properties of variants that are similar to what it has seen before — it can be genuinely useful for prioritizing which candidates to test, deprioritizing obvious dead ends, and flagging promising regions of sequence space. That value is real and should not be dismissed.

But the moment you push into genuinely novel functional space, the model's reliability degrades. And — this is the part that matters in a therapeutic context — it degrades silently. The model has no reliable internal signal that it is extrapolating beyond its training distribution. It does not return a lower confidence score because you asked it about a loop region it has never seen characterized. It returns a number. The number looks like the other numbers.

What "confidently wrong" looks like when the stakes are real

Imagine you are running an ML-guided campaign to identify a CNS-tropic capsid with reduced liver transduction. Your model scores a set of variants. The top candidates look promising — high predicted transduction in the target tissue, low predicted off-target signal. You manufacture, dose, and run your NHP study.

Your top candidate fails. Not modestly — it misses on the primary endpoint and shows unexpected hepatotoxicity at your intended dose. The program is delayed by six months while you regroup.

What happened? Perhaps the model's training data was enriched for variants selected under different manufacturing conditions, and the interaction between capsid sequence and your production process created a functional difference the model had no way to represent. Perhaps there was a subtle immunogenic epitope in the loop region the model scored highly — a feature that appears in natural sequences but was never present in the functionally characterized training data. Perhaps the model was extrapolating across a species gap that the training data couldn't bridge.

The model did not flag any of this. It could not — it had no way of knowing it was operating at the edge of its competence. The number it returned looked like the other numbers.

The person who catches this — ideally before the NHP study, in the candidate review meeting where someone asks the right question about training data provenance — is the experimental scientist. The one who knows that this loop region has never been characterized in the context you care about. The one who remembers that the last time the model was confident about a similar structural motif, the hit didn't transfer to your assay conditions. The one who understands not just what the model predicts, but what it was trained to predict, and where those two things diverge.

This is not a soft, qualitative observation about human intuition. It is the specific scientific function that prevents an ML-guided program from confidently accelerating in the wrong direction. And it cannot be automated away, because automation requires knowing in advance what to check for — which is precisely what you don't know when you're operating out of distribution.

What AI agents actually do — and what they leave untouched

It is worth being concrete about what current agentic AI systems in gene therapy actually do, because the gap between the marketing and the reality is instructive.

The most sophisticated deployments — and there are genuinely impressive ones emerging from well-resourced gene therapy organizations — are handling things like: autonomous parsing of experimental results and routing through quality control pipelines, natural language interfaces into internal databases and structure viewers, summarization of program history from internal documents, and orchestration of data flows between computational and experimental systems. These are real and meaningful productivity improvements. They reduce the cognitive overhead of administrative data management. They free scientists to spend more time on the work that requires biological judgment.

But the organizations building these systems are consistently saying the same thing in their own documentation:

"expert knowledge still outperforms general AI for scientific decisions, trust requires ground truth that the agents themselves cannot generate, and the systems must be deeply informed by domain expertise rather than operating as a replacement for it."

The agents handle the logistics of scientific delivery. The decision about what hypothesis to test, what a surprising result means biologically, and what to do when the model was confidently wrong about your most promising candidate — those decisions remain with the experimental scientist.

The last mile of drug discovery is not a data pipeline problem. It is a biological judgment problem. And biological judgment, in a field as complex and data-sparse as AAV, is not a temporary placeholder waiting to be automated. It is the work.

The Question Worth Sitting With

The most valuable person in the room is the one who knows what to screen — and why.

Across three realizations, the same conclusion emerges from different directions.

The data gap means the models need experimental scientists to generate the labels they learn from.
The knowledge opacity means the field's collective intelligence depends on scientists who understand what is and isn't being captured in published results.
And the out-of-distribution nature of novel capsid design means that the most consequential decisions in any ML-guided program are the ones that require biological judgment no model currently replicates.

The job is not at risk. But the role is evolving — and the evolution is meaningful.

The most valuable experimental scientist in an ML-augmented AAV lab is not the one who runs the most assays, or the one who can implement a neural network architecture, or the one who has adopted the most AI tools. It is the one who understands what a model is actually learning, what its training distribution covers and doesn't, and what it means when a confident prediction lands outside the space the model knows. It is the one who designs screening campaigns not just to find hits but to generate data that is worth learning from — that covers the real distribution of functional space rather than the convenient one. It is the one who asks, before committing to a candidate slate: what does this experiment teach the model that it cannot currently know?

That reframe turns the experimental scientist from a downstream consumer of ML outputs into something more important: the person who determines whether the system is capable of learning anything useful at all. The architect of the data flywheel, not just a user of its outputs.

The labs and companies that will matter in AAV-ML over the next decade aren't necessarily the ones with the most sophisticated models. They're the ones who can:

Design screening campaigns that cover real-world distribution rather than convenience samples

Run experiments at a pace that generates proprietary data faster than the field can catch up

Build institutional memory about what the model got wrong — and why

That last point is especially important. When your model predicts that a variant will transduce efficiently and it doesn't, the failure mode itself is data. What went wrong? Did the model extrapolate too far from its training distribution? Was there an immunogenic epitope the model wasn't trained to recognize? Understanding that failure requires a scientist who knows what to look for, knows what questions to ask of the assay data, and can design the follow-up experiment that would disambiguate.

An AI agent can parse your NGS output. It cannot design the experiment that will tell you why your top candidate failed in NHP when it worked in mouse.

The fear about AI replacing AAV scientists is understandable. It is also, structurally, backwards.

The question worth sitting with is not "will AI take my job?"
It is "am I generating the data that makes AI worth using — and do I understand the system well enough to know when it's working and when it isn't?"

Those are harder questions. They are also more interesting ones. And they point toward a version of this field where the most experienced experimental scientists are not made obsolete by better models — but become more essential precisely because of them.

Questions, pushback, field observations — send them in. The most interesting conversations in this space are the ones happening off-record.

PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Search This Blog

The AI × AAV Interpreter