Trained on strangers, tested on siblings

March 25, 2026

If you've looked into polygenic embryo screening, you've probably come across accuracy numbers. A score "explains 20% of disease variance." An AUC of 0.75. Those numbers come from real published research, and they look solid.

But they're answering the wrong question.

Those validation studies take thousands of unrelated people from a biobank, score their genomes, and check who gets sick. That's perfectly good science for understanding disease at a population level. The problem is, when you're screening embryos, you're not comparing strangers. You're comparing embryos from the same two parents. Same gene pool. Same family background. Telling your embryos apart is a fundamentally different statistical problem from telling strangers apart.

We tested our scores on siblings, because that's the test that actually matters for embryo screening. For 16 of 17 disease traits, they held up.

If you'd like to learn more about our screening, you can reach out to us here.

The easy test and why it misleads

People from different parts of the world have different genetics. They also have different environments, diets, healthcare access, and cultural practices. When you build a polygenic score from a large biobank and test it on the same kind of population, the score picks up on all of these differences at once. Some of that predictive power is genuinely genetic. But some of it is just correlation. The score might partly be capturing the fact that people from a particular genetic background tend to live in places with particular environmental exposures.

This is a well-known problem in statistical genetics. The standard fix is adding "principal components" to the analysis. These are mathematical summaries of ancestry patterns in the genome. Adjust for ancestry, the theory goes, and you strip out the environmental stuff and keep the real genetic signal.

Reasonable idea. It doesn't work.

Gilbert et al. (2025) showed that adding more principal components doesn't consistently reduce within-family attenuation. And Zaidi & Mathieson (2020) explained why: recent population structure involves patterns that common-variant principal components simply can't capture. The statistical correction can't substitute for actual sibling testing. Family-based studies sidestep these confounders entirely.

And it gets worse. Research from the Song Lab at UC Berkeley showed that highly parameterized polygenic scores can actively overfit to population structure through random effects. More variants in the model can actually make things worse. A score can look great on a new sample from the same population while performing poorly on the comparison that actually matters: siblings.

Bottom line: if you want to know whether a polygenic score captures real genetic effects, the kind that differ between embryos from the same parents, you have to test it on siblings. There's no statistical shortcut.

What our sibling validation showed

We built polygenic scores for 17 disease traits using SBayesRC, a method that integrates functional genomic annotations across roughly 7 million common genetic variants (Zheng et al. 2024, Nature Genetics). SBayesRC doesn't just count variants. It uses biological information about which regions of the genome are likely to be functional, weighting variants by their probability of actually doing something in the cell. This improves cross-ancestry prediction accuracy by up to 34% compared to previous methods and outperforms LDpred2, MegaPRS, and PRS-CSx.

Then we did the hard test. We validated on sibling pairs.

Sixteen of seventeen disease scores showed no decrease in predictive performance within families (Moore et al. 2025). That's not a marginal finding. It means the scores are picking up on real causal genetic variants, not artifacts of population structure. The within-family R-squared values were 28 to 193% higher than academic benchmarks and published competitor results.

For type 2 diabetes, the liability R-squared reached 0.21. For families in which both parents have type 2 diabetes and there are 10 embryos, the expected risk difference between the highest and lowest-risk embryos can be up to 23.5%. Those aren't hypothetical projections from population data. They're predictions validated the way that actually mirrors embryo screening: comparing siblings.

The one that didn't pass, and why that matters

Osteoporosis. Its within-family performance dropped meaningfully compared to the population-level result.

That might sound like a failure. It's actually one of the most informative results in the entire study.

Osteoporosis is the disease most influenced by shared household environment. Family members who grow up together share diet, exercise patterns, calcium intake, sun exposure. A population-level score picks up on those environmental correlations. A within-family test strips them away, because siblings share those household factors. When the environmental signal disappears, the score performs worse.

That's exactly what should happen if the method is working correctly. The osteoporosis result is the positive control. It proves the sibling test can detect confounding when confounding exists, and confirms that confounding is absent in the other sixteen diseases. If all seventeen had passed, you'd actually have reason to wonder whether the test was sensitive enough. One failure makes the sixteen successes more credible.

The "40 to 55% drop" you might have seen

You may have read that polygenic scores lose roughly 40 to 55% of their predictive power when tested within families. That finding is real.

But it comes from a specific context. Selzam et al. (2019) found that drop for educational attainment and behavioral traits. Educational attainment is heavily influenced by population stratification, socioeconomic confounders, and assortative mating. When you remove those signals by looking within families, a large chunk of the score's apparent predictive power evaporates.

That result doesn't generalize to all traits. Selzam et al. also found that anthropometric traits like height and BMI showed much smaller reductions. And our data shows that 16 of 17 disease traits showed no meaningful attenuation at all.

This matters because we don't screen embryos for educational attainment. The disease traits that families actually care about behave differently from the traits that dominate the within-family attenuation literature.

Independent work backs this up. Lin et al. (2025) confirmed that height and BMI show minimal within-family attenuation, and that the large attenuation effect concentrates in cognitive and educational traits. Smith et al. (2025) developed a method called PGSUS to decompose polygenic score variance into direct genetic effects versus stratification and assortative mating effects, finding evidence of stratification contamination specifically in scores for height and educational attainment from large meta-analyses.

The pattern is consistent: confounding is trait-specific, not universal. And for the disease traits we screen, the scores are clean.

What to ask your screening provider

If you're evaluating embryo screening, there's a specific question worth asking: "Was your validation done on siblings or on unrelated people?"

If the answer is unrelated people, or if there's no clear answer, the accuracy numbers you're seeing may not reflect how the scores actually perform when comparing your embryos. For most disease traits, population-level and within-family performance don't diverge dramatically, so the scores might still be decent. But "might be decent" isn't the standard you should accept for a decision about your family.

We published our within-family validation data because we think families deserve to see how scores perform in the actual context of screening. Not just that they work on strangers.

If you'd like to talk through what within-family validated scores mean for your specific situation, our team can walk you through it.