Skip the Data Bottleneck: Designing Capsids for Serotypes You Have No Data On
You want to engineer AAV9. Or AAV8. Or AAVrh10.
But every ML team tells you the same thing: "We need training data first."
So you plan a 12-month campaign. Generate a saturation mutagenesis library. Screen it. Sequence it. Clean the data. Then, finally, you can start building models.
By the time you're ready to design capsids, a year is gone and a few million dollars with it.
What if you didn't need to generate that data at all?
What Changed
Why This Works
The diffusion model learned those constraints from AAV2 data. When the generated sequences were grafted onto AAV9 at the equivalent positions, enough of those constraints transferred to produce viable particles.
This isn't magic. It's not guaranteed to work exactly the same for every serotype or every region. But it's proof that cross-serotype transfer is more than theoretical; it's experimentally validated.What This Means for Industry
- You may not need a proprietary dataset to start ML-guided design. If your target serotype is structurally similar to AAV2 (and most clinical serotypes are), public data might be enough to bootstrap your generative models.
- Speed advantage shifts. The teams that move fastest won't be the ones with the most data. They'll be the ones that leverage existing data most creatively.
- Your 12-month data generation campaign might be optional. At minimum, you should test whether transfer from public AAV2 data works for your serotype before committing to a long and expensive data collection effort.
The Bigger Picture for AAV Engineering
- Cross-serotype generalization is now experimentally validated. This was a theoretical possibility; now it's published results.
- Generative models are converging on similar capabilities. Diffusion models (AAVDiff), CNNs (Dyno), LLMs (CAP-PLM), different architectures, but all showing that public data can drive viable capsid design. The bottleneck isn't the model type. It's knowing how to use the data.
- The "no data" excuse for not using ML is expiring. If you're still doing pure random mutagenesis because "we don't have training data for our serotype," this paper is a direct challenge to that assumption.
Bonus: What About Other Regions?
AAVDiff worked because the Dyno deep-mutant dataset exists: 140K multi-mutant sequences in one region. But capsid engineering happens across multiple sites. What are your options elsewhere?
- VR-VIII (loop VIII, 564-591): You're covered. This is where the Dyno deep-mutant data lives. Generative models like AAVDiff can design here directly.
- Anywhere on the capsid (fitness/viability filtering): CAP-PLM (Sanofi) trained on single-mutant data spanning the entire AAV2 cap gene, not just one region. It predicts fitness with 0.82 correlation and generalizes to multi-mutants up to ~25 mutations. Use case: you have candidate sequences from any source (rational design, directed evolution, generative model), and you want to filter out dead variants before synthesis. Works across the whole capsid, not region-limited.
- Loops with public data: A diffusion model trained here is plausible.
- Loops with No public data: Your options:
- Generate the data yourself (expensive, 12+ months)
- Wait for someone else to publish it
- Try transfer from models of existing loops (unproven, but cross-serotype transfer worked, cross-region might too)
- Use CAP-PLM to filter candidates for viability, then screen for function
Final Thought
The data bottleneck for ML-guided AAV design is real; but it's not as insurmountable as it seemed.
Public datasets exist. Transfer learning works. You can start designing capsids for serotypes you have no data on, today, using freely available resources.
The question isn't whether to use ML for capsid design. It's whether you're going to spend 12 months generating data that might already exist in a usable form OR whether you'll test the shortcut first.
Credit:
PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Comments
Post a Comment