When You Only Have 1,000 Potential Hits
You screened an NNK library of capsid variants for receptor targeting.
1,000 passed.
You know generative models can help expand that set by learning the patterns of winners and generating more candidates.
But which generative model?
VAE? Diffusion? GAN? Autoregressive?
When your potential hit list is small, this choice matters more than you think.
The Problem with Most Generative Models at Small N
Why Autoregressive Models Handle Small Data Better
What We Found on AAV Engineering Data
- It worked with minimal data. We tested training sizes from 1,000 to 20,000 production-fit sequences. Performance held remarkably steady; even 1,000 training examples produced viable outputs.
- The outputs were novel, not memorized. Minimum Hamming distances from training examples: 16% at 1 mutation, 38% at 2 mutations, 37% at 3 mutations, 8% at 4+ mutations. The model learned the pattern and extrapolated.
- The critical result: ultra-high performers. When we manufactured and tested the generated variants, Wavenet's outputs included "very high performers" that matched or exceeded our best previously validated capsids. It found the rare outliers in the right tail; exactly what you need for AAV development.
What This Means for Industry
- Your architecture choice matters more than your dataset size. If you have 1,000-2,000 potential hits from a first-round screen and you're reaching for a VAE or diffusion model, reconsider. Autoregressive models should be your first choice for small functional datasets.
- Fine-tuning works. You don't need to train from scratch. Pretrained autoregressive protein LLMs can be fine-tuned on your small hit set. The base model provides the general protein grammar; your data teaches it what your winners look like.
- This is how you find outliers. Most screens optimize for average performance. But you don't need 1,000 "pretty good" capsids; you need the one exceptional variant that becomes your lead. Autoregressive models sample the learned distribution, including the rare tail that classifiers and other generative approaches underweight.
The Bigger Picture for AAV Engineering
- "Not enough data for generative models" is architecture-dependent. VAEs and diffusion models do need larger datasets. Autoregressive models don't, if you frame the problem right. Teams have been dismissing generative approaches based on the wrong architecture's requirements.
- The dependency structure is the key. Position-by-position conditional learning is a better match for protein sequences, where local context heavily constrains what's viable. This isn't just about data efficiency; it's about matching the inductive bias to the problem.
- Small functional datasets are more valuable than you think. 1,000 potential hits from a hard selection isn't a limitation. With the right model, it's a foundation.
Bonus: When You Must Use Classification
- If your data is tiny: Gaussian Processes. GPs handle uncertainty gracefully and don't overfit like neural networks. They're especially powerful for active learning loops where you're iteratively selecting candidates for validation.
- Feed as many negatives as possible, but weight your samples. The model needs to learn what doesn't work. Include your full screen, but weight the training loss toward the positive class. Watch precision/recall tradeoffs, or better: watch the metric that is most meaningful to your model usage, not just overall accuracy. Do not undersample! Do not undersample! Do not undersample!
- If you need to distinguish hits from near-misses: contrastive learning. When negatives are one mutation away from positives, standard classifiers struggle. Contrastive approaches are specifically designed to learn embeddings that separate similar-but-different sequences; exactly what you need when the decision boundary is tight.
- Use codon augmentation to expand your data. The same protein sequence can be encoded by many DNA sequences. Generate synonymous codon variants of your hits and use DNA or codon encoding. It's free data expansion; see Giessel et al., Bioinformatics Advances 2022.
Final Thought
When you only have ~1,000 hits, your generative model architecture matters.
Autoregressive models (Wavenet, SeqDesign, fine-tuned protein LLMs, etc) learn sequential dependencies explicitly. That inductive bias lets them work where VAEs and diffusion models struggle.
If you're sitting on a small, hard-won dataset of functional variants and wondering whether ML can help: it can. You just need the right tool.
Credit:
This work was done during my time leading the computational group at the Broad Institute's AAV Engineering program, led by Andrew J. Barry.
ASGCT 2023, Abstract #43: "Generative Networks Create Novel Receptor Targeted AAVs with Only 1,200 Training Examples" Andrew J. Barry, Fatma Elzahraa Eid, Ken Y. Chan, Qin Huang, Jencilin Johnston, Benjamin E. Deverman
PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Comments
Post a Comment