When You Only Have 1,000 Potential Hits

TL;DR When you only have ~1,000 functional hits, your generative model architecture matters more than your dataset size. Autoregressive models learn dependencies position-by-position; a stronger inductive bias that works where VAEs and diffusion models fail.

You screened an NNK library of capsid variants for receptor targeting.

1,000 passed.

You know generative models can help expand that set by learning the patterns of winners and generating more candidates.

But which generative model?

VAE? Diffusion? GAN? Autoregressive?

When your potential hit list is small, this choice matters more than you think.


The Problem with Most Generative Models at Small N

Most generative architectures (e.g., VAEs, diffusion models, GANs) aim to model the full joint distribution over the components of a sequence. That is, they attempt to capture P(x₁, x₂, x₃, ... xₙ) rather than modeling conditional factors independently.

That's a lot to learn. The interactions between positions, the correlations, the constraints; all of it has to be inferred from your training examples.
With 100,000 sequences, that works. The model sees enough variation to figure out what matters.
With 1,000 sequences? The joint distribution is underspecified. The model doesn't have enough examples to disentangle real patterns from noise.

Why Autoregressive Models Handle Small Data Better

Autoregressive models decompose the problem differently.
Instead of learning P(x₁, x₂, x₃, ...) directly, they learn:

P(x₁) · P(x₂|x₁) · P(x₃|x₁,x₂) · P(x₄|x₁,x₂,x₃) ...

Each position is predicted given the positions before it. The dependencies are explicit. The model doesn't have to infer the correlation structure from scratch; it's built into the architecture.

This is a stronger inductive bias. And when data is scarce, inductive bias is what saves you.

Wavenet, SeqDesign, autoregressive protein LLMs—they all share this structure. They learn the grammar of your sequences position by position, which is easier than learning the entire language at once.

What We Found on AAV Engineering Data

A few years ago, when I was leading the computational group at the Broad Institute's AAV program, we tested this directly.

We had ~1,200 capsid variants that bound a specific human receptor on CNS vasculature. We needed to generate more candidates: diverse, novel, but functionally similar.
We trained a Wavenet-based autoregressive model (SeqDesign) on these 1,200 hits.

  • It worked with minimal data. We tested training sizes from 1,000 to 20,000 production-fit sequences. Performance held remarkably steady; even 1,000 training examples produced viable outputs.
  • The outputs were novel, not memorized. Minimum Hamming distances from training examples: 16% at 1 mutation, 38% at 2 mutations, 37% at 3 mutations, 8% at 4+ mutations. The model learned the pattern and extrapolated.
  • The critical result: ultra-high performers. When we manufactured and tested the generated variants, Wavenet's outputs included "very high performers" that matched or exceeded our best previously validated capsids. It found the rare outliers in the right tail; exactly what you need for AAV development. 


What This Means for Industry

  • Your architecture choice matters more than your dataset size. If you have 1,000-2,000 potential hits from a first-round screen and you're reaching for a VAE or diffusion model, reconsider. Autoregressive models should be your first choice for small functional datasets.
  • Fine-tuning works. You don't need to train from scratch. Pretrained autoregressive protein LLMs can be fine-tuned on your small hit set. The base model provides the general protein grammar; your data teaches it what your winners look like.
  • This is how you find outliers. Most screens optimize for average performance. But you don't need 1,000 "pretty good" capsids; you need the one exceptional variant that becomes your lead. Autoregressive models sample the learned distribution, including the rare tail that classifiers and other generative approaches underweight.


The Bigger Picture for AAV Engineering

  • "Not enough data for generative models" is architecture-dependent. VAEs and diffusion models do need larger datasets. Autoregressive models don't, if you frame the problem right. Teams have been dismissing generative approaches based on the wrong architecture's requirements.
  • The dependency structure is the key. Position-by-position conditional learning is a better match for protein sequences, where local context heavily constrains what's viable. This isn't just about data efficiency; it's about matching the inductive bias to the problem.
  • Small functional datasets are more valuable than you think. 1,000 potential hits from a hard selection isn't a limitation. With the right model, it's a foundation.


Bonus: When You Must Use Classification

Sometimes generative models aren't the right tool; sometimes a classifier is the best utilization of the data you have or would give you the answer you're looking for. Here's how to maximize signal from small datasets in classification settings:

  • If your data is tiny: Gaussian Processes. GPs handle uncertainty gracefully and don't overfit like neural networks. They're especially powerful for active learning loops where you're iteratively selecting candidates for validation.
  • Feed as many negatives as possible, but weight your samples. The model needs to learn what doesn't work. Include your full screen, but weight the training loss toward the positive class. Watch precision/recall tradeoffs, or better: watch the metric that is most meaningful to your model usage, not just overall accuracy. Do not undersample! Do not undersample! Do not undersample! 
  • If you need to distinguish hits from near-misses: contrastive learning. When negatives are one mutation away from positives, standard classifiers struggle. Contrastive approaches are specifically designed to learn embeddings that separate similar-but-different sequences; exactly what you need when the decision boundary is tight.
  • Use codon augmentation to expand your data. The same protein sequence can be encoded by many DNA sequences. Generate synonymous codon variants of your hits and use DNA or codon encoding. It's free data expansion; see Giessel et al., Bioinformatics Advances 2022.


Final Thought

When you only have ~1,000 hits, your generative model architecture matters.

Autoregressive models (Wavenet, SeqDesign, fine-tuned protein LLMs, etc) learn sequential dependencies explicitly. That inductive bias lets them work where VAEs and diffusion models struggle.

If you're sitting on a small, hard-won dataset of functional variants and wondering whether ML can help: it can. You just need the right tool.


Credit: 

This work was done during my time leading the computational group at the Broad Institute's AAV Engineering program, led by Andrew J. Barry

ASGCT 2023, Abstract #43: "Generative Networks Create Novel Receptor Targeted AAVs with Only 1,200 Training Examples" Andrew J. Barry, Fatma Elzahraa Eid, Ken Y. Chan, Qin Huang, Jencilin Johnston, Benjamin E. Deverman

SeqDesign method: Shin, Jung-Eun, et al. "Protein design and variant prediction using autoregressive generative models." Nature Communications 12.1 (2021): 1-11.

PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.


Comments

Popular posts from this blog

Why AlphaFold Won't Engineer Your Next AAV Capsid

Why Protein Language Models (PLMs) won't let you explore distant AAV capsids + how to fix it