AAV-ML for Experimentalists #2: ML That Scores Your Variants


TL;DR Predictive AI is a prioritization tool. It helps you decide what's worth testing.

You have a library of 10,000 candidate variants. You can only test 100.

Which ones do you pick?

Random selection? You'll waste most of your effort on duds. Cherry-pick based on intuition? You might miss the best candidates.

This is the core problem predictive models solve. They score your variants so you can prioritize intelligently.

The Core Concept

In the last post, we covered generative models — ML that proposes new sequences for you to make.

Predictive models do the opposite. They don't propose anything new. They take sequences you already have and score them.

Input: a sequence.
Output: a score or a label for a property you care about.

That's it. It's a prioritization tool. It helps you decide what's worth testing.

Why Bother?

You can't screen everything. Budgets are finite. Timelines are tight. Screening capacity has limits.

Random selection is wasteful. If only 1% of your candidate library has the property you want, random sampling means 99% of your effort goes to failures.

A good predictive model flips those odds. It concentrates your screening on variants more likely to succeed.

Even modest enrichment matters. If a model gives you 2-3x enrichment, you're getting the same number of hits with a fraction of the effort. Or more hits with the same effort.

That's real value: faster timelines, lower costs, better candidates.

How It Works (No Math, Just Intuition)

The model learns from examples.

You give it sequences paired with experimental outcomes: this one packaged, that one didn't; this one transduced well, that one poorly.

It looks for patterns. Which amino acids at which positions associate with good outcomes? Which combinations tend to fail?

It doesn't memorize your data. It learns generalizable features — what "good" sequences tend to look like.

When you give it a new sequence, it asks: "Does this look more like the winners or the losers?"

The output is a score. Higher score means the model thinks it's more likely to have the property.

Two Modes of Prediction

This is where it gets practical. Predictive models answer two related but different questions. Understanding which one you need shapes how you use them.

Mode 1: Classification, "Will it work?" (Filtering out failures)

This is a binary question. Will this variant package — yes or no? Does it transduce — yes or no? Does it bind — yes or no?

The model acts as a triage filter. Its job is to remove the obvious failures before you spend resources on them.

The output is either a yes/no call or a probability: "The sequence is 80% likely to package."

When to use it: Before expensive steps. Don't synthesize variants predicted to fail. Don't put likely duds into your screen.

Example: You've designed 100,000 variants computationally. Synthesis costs limit you to 10,000. A packaging fitness classification model predicts which ones are viable. You synthesize only the predicted packagers.

The value: You're not wasting synthesis budget on sequences that would never have worked anyway.

Mode 2: Regression, "How well will it work?" (Prioritizing candidates)

This is a ranking question. Among the variants that work, which ones are best?

The model doesn't just say yes or no. It gives you a continuous score that lets you rank: variant A > variant B > variant C.

When to use it: When you can only test a subset and want to maximize your chances of finding the best performers.

Example: You have 10,000 variants predicted to package. But you can only screen 500 for transduction. A transduction regression model ranks them. You test the top 500 instead of a random 500.

The value: You're concentrating your screening effort on the most promising candidates.

How They Work Together

Often, you use both modes sequentially.

First, filter: remove variants predicted to fail entirely.

Then, rank: prioritize among the survivors.

Think of it as a funnel:

100,000 designed variants
Filter to 10,000 predicted to package
Rank by predicted transduction
Screen top 1,000

Sometimes a single model does both. A continuous score where zero means "won't work" and higher means "works better." But conceptually, you're still filtering and ranking.

Which mode matters most depends on your bottleneck. Limited synthesis capacity? Filtering is key. Limited screening bandwidth? Ranking is key. Both? Use both.

What Can Be Predicted?

Predictive models exist for various AAV properties:

Production/packaging fitness: The most common and most validated application. Will the capsid assemble and package DNA? This is well-suited for prediction because large training datasets exist.

Biodistribution and Transduction efficiency: Harder than packaging. Transduction depends on cell type, tissue, delivery route, even species. A model trained on HEK293 transduction may not predict liver transduction in mice.

Receptor binding: Can the capsid bind a specific receptor? More targeted, requires specific training data.

Multiple properties at once: Emerging "multi-task" models predict several traits jointly. This helps when training data for any single property is sparse.

The key point: A model can only predict what it was trained on. A packaging model knows nothing about tropism. A mouse transduction model might not transfer to NHP. Always ask: what property, in what context, was this trained to predict?

Example: In work I led at the Broad Institute, we trained regression models across dozens of traits, spanning production fitness, cell binding and transduction, mouse biodistribution and transduction, and non-human primate (NHP) targeting. Read the full story here.

In related published work, we developed and applied production fitness classification models to evaluate candidate variants generated from first-round screens and generative models trained on them. These classification models were used to systematically filter out non-viable capsids prior to second-round screening. Read the full story here.

Two Ways to Get a Predictive Model

Option 1: Use an existing model

Models trained on public datasets or shared by collaborators may already cover your property of interest.

Good for: Packaging fitness, where large public datasets exist (like Dyno's published AAV2 data).

Limitation: The model was trained on someone else's library, serotype, and assay. It may not transfer perfectly to yours. Performance can degrade when your variants look very different from the training data.

Option 2: Train on your own data

Screen a subset of your library. Use those results to train a model. Then prioritize the rest.

The workflow:

Round 1: Screen 10,000 variants → get experimental labels
Train: Build a model on those results
Predict: Score the remaining 90,000
Round 2: Test the top predictions

Good for: When you need predictions tailored to your specific assay, serotype, or property.

Limitation: Requires upfront screening investment before you get the model's benefit.

The hybrid approach
Start with an existing model for initial filtering. Generate your own data. Then train a custom model for later rounds.

What Predictive Models Need

Labeled training data: Sequences paired with experimental outcomes. No labels, no model.

Enough positives AND negatives: The model learns by contrast. If you only show it winners, it can't learn what failure looks like. If you only show it failures, it can't learn what success looks like. Balance matters. If you only have positivies, then use generative models.

Reasonable signal in the data: If your assay is very noisy or your outcomes are random, no model can extract a pattern. Garbage in, garbage out.

For ranking, quantitative measurements: Binary labels (worked/didn't) enable filtering. But for ranking, you need continuous measurements — titers, fold-enrichment, binding scores. The more granular your data, the better the model can rank.

What Predictive Models Can and Cannot Do

They can:

Filter out likely failures before you spend resources

Rank variants by likelihood of having a property

Enrich your hit rate (2-10x is typical)

Speed up your campaign by focusing effort

They cannot:

Guarantee that top-scored variants will work — it's probability, not certainty

Predict properties they weren't trained on — packaging models don't know tropism

Extrapolate far beyond their training distribution — if your sequences look very different, predictions become unreliable

Replace experimental validation — predictions must always be tested

The Key Distinction: Interpolation vs. Extrapolation

This is critical for setting expectations.

Interpolation: The model scores sequences that are similar to what it saw during training. It's filling in gaps within known territory. Models are generally reliable here.

Extrapolation: The model scores sequences that are very different from training data. It's venturing into unknown territory. Models struggle here.

Most predictive models interpolate well. Few truly extrapolate.

Before trusting a prediction, ask: How similar are my candidates to the training set? If you're scoring variants with 20 mutations when the model only saw variants with 5, be cautious. The model is guessing outside its experience.

Where the Field Stands

Predictive AI in AAV has moved from academic curiosity to standard practice. Here's what's happening now (based on literature and ASGCT abstracts up to 2025):

Packaging fitness prediction is mature. Multiple groups have validated models. Public data enables benchmarking. This is the most reliable current application.

Transduction and tropism prediction is harder. More context-dependent — species, tissue, delivery route all matter. Models trained in one context often don't transfer to another.

Multi-property models are emerging. Predicting several traits jointly helps with data sparsity. If you have limited transduction data but abundant packaging data, a joint model can share information between tasks.

The trend: Predictive models are increasingly used as filters — either to triage designed libraries before synthesis or to score generative model outputs before validation. They're becoming a standard step in the workflow, not a novelty.

How It Fits Your Workflow

Before synthesis: Score designed variants. Filter out predicted failures. Save synthesis costs.

Before screening: Rank your library. Test the top tier first. If you find enough hits, you may not need to screen everything.

After generation: Generative models propose candidates. Predictive models score them. You only make the ones that pass both tests.

Iteratively: Screen → train → predict → screen top predictions → refine model → repeat. Each round improves your model and focuses your effort.

The Questions to Ask

When someone offers you a predictive model or presents predictions, ask:

"What property was this trained to predict?" Packaging and transduction are different. Mouse and NHP are different. Make sure the model matches your question.
"What was the training data?" Serotype? Library type? Assay conditions? Species? The more your situation differs from training, the less reliable the predictions.
"Is this filtering or ranking?" Binary output (yes/no) or continuous score? Make sure the output matches your decision: triage vs. prioritization.
"Interpolation or extrapolation?" How different are your variants from what the model saw? Similarity breeds reliability. Distance breeds uncertainty.
"How was it validated?" Held-out test set? Prospective experimental confirmation? Computational metrics alone aren't proof. You want evidence the model works on sequences it wasn't trained on.

Final Thought

Predictive models score your variants so you can prioritize intelligently.

Use them to filter out failures before wasting resources. Use them to rank candidates when you can't test everything.

They don't guarantee success. They improve your odds.

The key is knowing what the model was trained on, how similar your sequences are, and what question it's actually answering.

Next in the series: How ML Fits Into Your Experiments (connecting the computational and wetlab sides of the workflow).

PS: This is what The AIxAAV Interpreter is for: translating ML methods into actionable AAV engineering strategies. Follow me on LinkedIn for more practical insights that accelerate bio-innovation.

Search This Blog

The AI × AAV Interpreter