Why Protein Language Models (PLMs) won't let you explore distant AAV capsids + how to fix it

TL;DR: Your PLM is confusing "distant" AAVs with "broken" AAVs.
Left: Raw PLM scores unfairly penalize deep mutants, trapping high-fitness variants behind a "distance barrier."
Right: Depth-normalization removes the bias, allowing heavily mutated sequences (like novel AAV capsids) to shine just as bright as near-WT ones.
→ Don't let your model hide the best proteins just because they look different.

If you've ever used a protein language model (like ESM) to filter AAV capsid libraries, you've probably noticed something frustrating:
Deep mutants always look bad.

That 12-mutation capsid you designed for immune evasion? ESM scores it poorly. The ancestral reconstruction with 15 changes? Even worse. Meanwhile, the 2-mutation tweak to AAV9 looks great, not because it's better, but because it's closer to what the model already knows.

This is the distance bias problem, and it quietly shapes which capsids make it through your pipeline.

What's happening:

PLMs like ESM learned from natural sequences. They've internalized an accurate belief: most random mutations break proteins. So when you hand them a sequence far from anything in their training set, they hedge their bets and score it low.

The result? Three capsids with equal true transduction efficiency, but 1, 5, and 10 mutations from wild type, get scored as Excellent → Good → Weak. Not because of biology. Because of distance.

Why does this matter in AAV engineering?

In AAV engineering, this matters. Especially when you're trying to:

Escape neutralizing antibodies (requires getting far from natural serotypes)

Explore novel tropisms (often found in deep sequence space)

Make large insertions or domain swaps (automatically pushes you far from WT)

The fix is simple but powerful:

Figure 1: Distance bias in PLM scoring and the calibration fix. (A) Raw PLM scores penalize mutational distance regardless of true fitness. (B) Percentile calibration within each depth removes the penalty, revealing that capsids A, B, and C are equally promising.

Instead of comparing raw PLM scores across your whole library:

Group candidates by mutational depth
Build a background distribution for each depth (uniform shell sampling works)
Convert raw scores → percentiles within each depth
Rank by depth-normalized percentiles

Now your 10-mutation immune escape variant competes fairly against other 10-mutation capsids, not against 2-mutation tweaks.

Where this changes the game in AAV:

Immune evasion screens: Stop penalizing the very distance you're trying to achieve

VR insertion libraries: Long insertions naturally push sequences far from WT; calibration lets them compete

Ancestral/chimeric designs: These often sit at intermediate distances where raw PLM bias is strongest

Active learning loops: Without calibration, your model keeps pulling you back toward known serotypes

What this means for industry:

If you're at a gene therapy company running ML-guided capsid screens, this bias is silently shaping your pipeline, and probably not in your favor.

The capsids that survive your filters tend to be safe, incremental, close to known serotypes. The bold candidates, the ones that might actually solve your immunogenicity or tissue-targeting problem, get ranked down before anyone reviews them.

That's not a science problem. It's a prioritization problem. And it's fixable in an afternoon.

One calibration step. No new model. No new data. Just a statistical correction that lets your pipeline see what it's been missing.

The bigger picture:

This is the kind of "model-aware engineering" that separates teams who use ML tools from teams who get results. The tool isn't broken, it's behaving exactly as trained. You just have to know where it's biased and correct for it.

Credit: This technique was elegantly demonstrated by Ada Shaw and the Debora Marks Lab (Harvard Medical School). Paper: [Link]

PS: If you want to go deeper on translating ML predictions into actionable AAV biology, that's what The AIxAAV Interpreter is for. I've spent two decades bridging ML theory and application.

Follow me on LinkedIn (#AIxAAV #TheBioMLClinic #TheBioMLPlayBook) for more practical insights that accelerate bio-innovation.

Search This Blog

The AI × AAV Interpreter