Why AlphaFold Won't Engineer Your Next AAV Capsid

TL;DR Structure prediction tools answer a question about shape. AAV capsid engineering requires answers about function. These are not the same question, and confusing them is how engineering programs fail.



The conversation that prompted this article

A gene therapy program recently asked me to advise them on a potential computational collaboration. The pitch from the computational side was straightforward: they would use AlphaFold-like tools to identify functional AAV capsid variants. Not as a starting point for further screening. Not as a filter to eliminate obvious failures before experimental testing. As the primary discovery engine. Zero-shot. Predict the structure, identify the functional capsid, done.

I did not advise them to proceed.

Not because the computational team was incompetent. They were not. Not because structure prediction tools are bad. They are not. But because the entire proposal was built on a category error: the assumption that predicting what a capsid looks like tells you what it will do. In AAV engineering, that assumption is not just wrong. It is expensive. Programs that proceed on it burn time, burn budget, and produce capsids that look beautiful in a figure and fail in an animal.

This article is my attempt to explain precisely why, because this conversation is happening in boardrooms and partnership meetings across the gene therapy field right now, and the people on the biology side deserve a clear framework for evaluating what they are being sold.


First, credit where it is due

AlphaFold2 and its successors are genuine scientific achievements. Predicting protein structure from sequence with near-experimental accuracy was considered an unsolved grand challenge for decades. For most well-folded proteins, that problem is now largely solved. These tools changed what is possible in structural biology. They deserve that recognition.

This article is not an attack on AlphaFold. It is about a precise mismatch between what this category of tools was designed to do and what AAV capsid engineering actually requires. That mismatch has consequences, and the field needs to name it clearly.


What structure prediction tools do well

AlphaFold, ESMFold, RoseTTAFold and the broader family of structure prediction models share one design goal: given an amino acid sequence, predict the three-dimensional conformation that protein is most likely to adopt. They do this with remarkable accuracy for globular, well-folded proteins.

In AAV capsid work specifically, these tools have a legitimate role. You can use them to understand how an engineered variant relates structurally to the parent serotype. You can use them to identify obvious steric clashes in a proposed insertion. You can use them to compare loop configurations across serotypes and contextualize where in the capsid shell a modification sits. These are real contributions.

The problem starts when you ask them to do something they were never built for.


The category-level limitation

Structure prediction models are trained on static structures. They output a single conformation: the most probable energy minimum for a given sequence. That is what they optimize for. That is what their training data contains. That is what the loss function rewards.

This is not a flaw. It is a design choice that made the problem tractable. But it means this entire category of tools cannot see two things that matter enormously in biology.

The first is conformational ensembles. Proteins do not exist in a single state. They sample distributions of conformations, and functional behavior is a property of that distribution, not of any single point within it. The single predicted conformation may not even be the most functionally relevant one. It is simply the most probable one thermodynamically.

The second is the effect of perturbation. When you introduce an amino acid substitution or a peptide insertion, you are not just changing a static shape. You are shifting the energy landscape: changing which conformations are accessible, how the distribution is weighted, how the protein responds to its environment. Structure prediction gives you a new single conformation for the perturbed sequence. It does not tell you how the population of states has shifted.

These are not edge cases. They are the central challenge of protein engineering.


Why AAV capsid engineering lives exactly in this gap

The AAV capsid is not a passive container. It is a machine that must execute a precise sequence of functional events: circulate in vivo, evade neutralizing antibodies, engage a cell-surface receptor, trigger endocytosis, survive endosomal acidification, escape to the cytoplasm, traffic to the nucleus, uncoat, and release its genome. Every step depends on dynamic capsid behavior, not static structure.

The literature is explicit on this. Available cryo-EM and crystallographic structures represent a snapshot of the capsid in a low-energy conformation. The capsid in solution must be dynamic to carry out the full sequence of functions required for infection. Conformational changes during endosomal trafficking, VP1 N-terminus externalization through the five-fold pore, and pH-triggered structural rearrangements are all functional requirements that a static structure cannot capture.

The regions that matter most for engineering are the variable regions: VR-I, VR-IV, VR-V, VR-VIII, and VR-IX. These surface-exposed loops are not rigid. Their flexibility is not incidental. It is functional. VR-I, for example, is inherently flexible in solution. Receptor engagement stabilizes its conformation. This flexibility appears to have been evolutionarily selected for: it enforces transduction specificity by requiring receptor contact before the loop adopts a stable, infection-permissive state. A static structure prediction gives you that loop at its energy minimum. It tells you nothing about how the loop behaves before, during, or after receptor contact.

When you engineer a variant, insert a targeting peptide, mutate residues to shift tropism, or optimize a library for CNS selectivity, you are intervening in an ensemble process. The question you are asking is not "what does this variant look like?" It is "how does this variant perform across the distribution of functional states it samples, in the tissue environment it will encounter, against the immune landscape of the patient population you are targeting?" Structure prediction cannot answer that question. Not because it is flawed, but because that question is outside the scope of what it was designed to solve.


What about embeddings? What about protein language models?

This is the objection a knowledgeable reader will raise, and it deserves a precise answer.

Using AF-derived embeddings as input features to a downstream ML model is more defensible than using AF structural outputs directly. Those embeddings encode real information: residue co-evolution, evolutionary constraints, local structural context. For predicting whether an insertion will catastrophically disrupt capsid assembly, or whether a point mutation is grossly destabilizing, AF embeddings can contribute signal. That is a legitimate use.

But those embeddings were learned from static structure data. They do not encode conformational dynamics, ensemble behavior, or functional state transitions. When you train a downstream model on AF embeddings to predict transduction efficiency or in vivo tropism, you are asking a shape-trained representation to generalize to a function-prediction problem. The representation does not contain what you need. You may get a model that correlates structural features with functional outcomes where those correlations happen to exist. But you will miss the functional variance driven by dynamic behavior the embedding never saw.

Protein language models (PLMs) are a more interesting case. ESM2, ProtTrans and related models are trained on sequence databases alone, with no structural supervision. What they learn is the statistical grammar of protein sequences: which substitutions are tolerated, which residue patterns co-occur, which positions are under strong evolutionary constraint. That is genuinely different from what AF learns. A protein language model embedding carries information about evolutionary fitness pressure, which is at least adjacent to functional constraint, without being anchored to a single static conformation. For predicting insertion tolerance or mutational stability, PLM embeddings may outperform AF embeddings precisely because they are not committed to one structural interpretation.  

ESM3 goes further. It is a multimodal model trained jointly on sequence, structure, and function. That sounds like exactly what AAV engineering needs. But there is a critical detail: the function annotations in ESM3's training data are largely synthetic, predicted rather than experimentally measured. ESM3 learns the connection between sequence, structure, and predicted function at evolutionary scale. That is powerful for generative protein design and for reasoning about natural protein space. It is not the same as learning from measured transduction efficiency, neutralization resistance, or manufacturing yield in your specific assay system. The model has never seen those numbers. It cannot have learned from them.


The honest hierarchy is this:

  • AF structural predictions: wrong output for functional prediction in AAV engineering.
  • AF embeddings as features: useful for structural disruption tasks, shape-biased and limited for functional prediction.
  • PLM embeddings as features: more useful than AF for tolerability and evolutionary constraint, still not trained on the signal you actually need.
  • Measured functional data: the only direct path to the prediction you are trying to make. Even with sequence-extracted features as simple as hot-encoding. 
  • Everything else is a proxy. The distance between the proxy and the target is where engineering programs fail.


What actually predicts functional outcomes

Measured fitness data. Experimental observations of how variants perform in cells, in tissue, in animal models, capture the ensemble behavior that no representation trained on structure or sequence alone can provide. A variant's transduction efficiency in a relevant cell type is an integrated readout of every conformational and dynamic event in the infection pathway. Structure tells you what the capsid looks like at its energy minimum. Fitness data tells you what it does.

This is why ML models trained on functional measurements, transduction, packaging efficiency, neutralization resistance, manufacturability, even with sequence features as simple as hot-encoding, can outperform structure-based approaches for capsid engineering. They are not learning the shape of the capsid. They are learning the relationship between sequence and functional outcome, with all the dynamic complexity of that relationship implicitly encoded in the measurements.

This is also why data quality, assay design, and measurement noise are not peripheral concerns in AAV ML. They are central. The measurement is the signal. Everything else is the model's attempt to generalize from it.


Where the field needs to go

The next generation of protein AI needs to move beyond single-conformation prediction toward models that capture ensembles, relative state populations, and the effects of perturbations on those distributions. That is a harder problem. It will likely require integrating generative models with learned energetics, dynamic sampling, and large-scale functional measurement programs that go well beyond static structural databases.

For AAV capsid engineering specifically, this means building ML frameworks that treat functional outcomes as the primary supervision signal, not structural similarity, not sequence identity, not predicted folding confidence. The capsid that folds correctly is not necessarily the capsid that works. Or even the capsid that packages correctly under experimental conditions. The capsid that works is the one that performs across the full functional gauntlet, in the right tissue, in the right patient, at a dose that is manufacturable and safe.

Structure prediction got us to the point where we understand what these machines look like. The field now needs tools that tell us how they behave. That is a different problem. It is the right problem.


A final note on the tools

AlphaFold, ESMFold, and the structure prediction ecosystem will remain part of the AAV engineering toolkit. Use them for structural context. Use them for hypothesis generation. Use them to flag obvious assembly-breaking mutations before you run an experiment. These are the questions they were designed to answer.

Do not ask them to substitute for functional measurement. Do not use structural confidence scores as proxies for in vivo performance. Do not let a beautiful predicted structure convince you a variant will work.

Knowing the difference between what a tool was built for and what you need it to do is not a criticism of the tool. It is good scientific practice.

The field is ready for the next question. What does it do?


PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Comments

Popular posts from this blog

Why Protein Language Models (PLMs) won't let you explore distant AAV capsids + how to fix it