Skip the Data Bottleneck: Designing Capsids for Serotypes You Have No Data On

TL;DR Public AAV2 data → generative model → viable AAV9 capsids. No AAV9 data needed. 50% hit rate at 9+ mutations. You can transfer the same way to other AAV serotypes; the data bottleneck just got a shortcut.

You want to engineer AAV9. Or AAV8. Or AAVrh10.

But every ML team tells you the same thing: "We need training data first."

So you plan a 12-month campaign. Generate a saturation mutagenesis library. Screen it. Sequence it. Clean the data. Then, finally, you can start building models.

By the time you're ready to design capsids, a year is gone and a few million dollars with it.

What if you didn't need to generate that data at all?

What Changed

A team from Carbonsilicon AI, OBIO, Zhejiang University, and Tsinghua University just demonstrated something that should change how you plan AAV engineering campaigns.

They trained a discrete diffusion model (D3PM) on publicly available AAV2 data: the ~140K multi-mutant sequences from the Dyno/Bryant dataset covering the hypervariable region around loop VIII (positions 561-588).

Then they did something interesting: they took the sequences generated by this AAV2-trained model and transferred them to AAV9.

No AAV9 training data. No AAV9 screens. Just structural homology and a bet that the "grammar" of viable capsids transfers across serotypes.

The result: ~50% viability at 9-10 mutations.

For context, random mutagenesis at that mutational depth gives you nearly zero viable capsids. The Ogden data showed hit rates collapsing past 5 mutations. This model climbed past that cliff, on a serotype it had never seen.

Why This Works

AAV serotypes share high structural conservation, especially in the core regions that determine whether a capsid can assemble and package DNA. The hypervariable regions differ in sequence, but the underlying constraints on what makes a viable capsid are similar.

The diffusion model learned those constraints from AAV2 data. When the generated sequences were grafted onto AAV9 at the equivalent positions, enough of those constraints transferred to produce viable particles.

This isn't magic. It's not guaranteed to work exactly the same for every serotype or every region. But it's proof that cross-serotype transfer is more than theoretical; it's experimentally validated.

What This Means for Industry

You may not need a proprietary dataset to start ML-guided design. If your target serotype is structurally similar to AAV2 (and most clinical serotypes are), public data might be enough to bootstrap your generative models.

Speed advantage shifts. The teams that move fastest won't be the ones with the most data. They'll be the ones that leverage existing data most creatively.

Your 12-month data generation campaign might be optional. At minimum, you should test whether transfer from public AAV2 data works for your serotype before committing to a long and expensive data collection effort.

The Bigger Picture for AAV Engineering

Cross-serotype generalization is now experimentally validated. This was a theoretical possibility; now it's published results.

Generative models are converging on similar capabilities. Diffusion models (AAVDiff), CNNs (Dyno), LLMs (CAP-PLM), different architectures, but all showing that public data can drive viable capsid design. The bottleneck isn't the model type. It's knowing how to use the data.

The "no data" excuse for not using ML is expiring. If you're still doing pure random mutagenesis because "we don't have training data for our serotype," this paper is a direct challenge to that assumption.

Bonus: What About Other Regions?

AAVDiff worked because the Dyno deep-mutant dataset exists: 140K multi-mutant sequences in one region. But capsid engineering happens across multiple sites. What are your options elsewhere?

VR-VIII (loop VIII, 564-591): You're covered. This is where the Dyno deep-mutant data lives. Generative models like AAVDiff can design here directly.

Anywhere on the capsid (fitness/viability filtering): CAP-PLM (Sanofi) trained on single-mutant data spanning the entire AAV2 cap gene, not just one region. It predicts fitness with 0.82 correlation and generalizes to multi-mutants up to ~25 mutations. Use case: you have candidate sequences from any source (rational design, directed evolution, generative model), and you want to filter out dead variants before synthesis. Works across the whole capsid, not region-limited.

Loops with public data: A diffusion model trained here is plausible.

Loops with No public data: Your options:

Generate the data yourself (expensive, 12+ months)

Wait for someone else to publish it

Try transfer from models of existing loops (unproven, but cross-serotype transfer worked, cross-region might too)

Use CAP-PLM to filter candidates for viability, then screen for function

Final Thought

The data bottleneck for ML-guided AAV design is real; but it's not as insurmountable as it seemed.

Public datasets exist. Transfer learning works. You can start designing capsids for serotypes you have no data on, today, using freely available resources.

The question isn't whether to use ML for capsid design. It's whether you're going to spend 12 months generating data that might already exist in a usable form OR whether you'll test the shortcut first.

Credit:

The AAVdiff work discussed here is from: "AAVDiff: Experimental Validation of Enhanced Viability and Diversity in Recombinant Adeno-Associated Virus (AAV) Capsids through Diffusion Generation" (2024).

Preprint: https://arxiv.org/abs/2404.10573

Authors: Lijun Liu, Jiali Yang, Jianfei Song, Xinglin Yang, Lele Niu, Zeqi Cai, Hui Shi, Tingjun Hou, Chang-yu Hsieh, Weiran Shen, Yafeng Deng

At: Carbonsilicon AI, OBIO Technology, Zhejiang University, and Tsinghua University.

PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Search This Blog

The AI × AAV Interpreter