The '2.5 IQ points' claim: why a 2019 height-and-IQ study doesn't apply to disease screening

If you're a prospective parent researching polygenic embryo screening, you've almost certainly hit this claim: a 2019 study in Cell found that screening embryos for height and IQ would gain "only about 2.5 IQ points." It shows up in Science magazine, in policy discussions, and in pretty much every article arguing that embryo screening has "limited utility." Most of that criticism is aimed at cognitive screening specifically.

There are two problems with how the study gets used. First, the 2019 input values it modeled are nearly fivefold out of date for IQ today. Second, it never studied disease screening at all. The same research group later applied the same methodology to disease, and the answer there looks very different from what 2019 continuous-trait predictors produced.

Neither point has caught up in the mainstream coverage.

What Karavani actually studied

Karavani et al. (2019) asked a very specific mathematical question: if you pick the top-scoring embryo from a set of ten, how much does the average move for continuous traits like height and IQ? They used the predictors available in 2019: the most accurate height predictors at the time, and educational-attainment-derived IQ predictors that captured roughly 4.3% of cognitive variance. For those traits, with those inputs, the answer really was modest. A couple of IQ points. A couple of centimeters.

But disease screening asks a completely different question. It doesn't ask "by how many points does the mean shift?" It asks "does this embryo cross a clinical threshold for developing a disease?"

Those are different calculations. And they produce very different answers.

Think about how we evaluate blood pressure medication. Nobody asks "how much does it shift the population mean blood pressure?" right? That would be absurd. The useful question is "how many heart attacks does it prevent?"

That's the distinction the media coverage missed. Karavani answered the mean-shift question for height and IQ. Journalists applied the answer to the threshold question for conditions like type 2 diabetes and schizophrenia. Those are fundamentally different pieces of math.

Here's why this matters. IQ scores are normally distributed. Shift the mean by 2.5 points, and the person at the 50th percentile barely notices. But near the tails, the math changes dramatically. A small shift in the mean produces a disproportionately large change in the proportion of people who cross a fixed threshold. Lencz et al. (2021) calculated this directly: a 2.5 IQ point mean gain corresponds to a 33.5% reduction in the probability of IQ below 70 for individuals near that threshold.

That same number, 2.5 points, looks trivial or consequential depending entirely on what question you're asking. And disease works the same way. Type 2 diabetes either develops or it doesn't. When the question is "does this person cross the disease threshold," you're no longer measuring how much the average moves. You're measuring how many people land on which side of a line. That's where screening produces real differences in outcomes.

The same method, applied to disease

Two years after Karavani, Shai Carmi (a co-author on the original paper) co-authored Lencz et al. (2021), which extended the same expected-gain framework from continuous traits to disease screening. Picking the lowest-risk embryo from a set of just five can reduce disease risk by over 20%. Under favorable conditions, the reduction reaches 50 to 80%.

For parents going through IVF who have a family history of a serious condition, that's the difference between "barely worth doing" and "one of the most impactful decisions you can make for your child's health."

Karavani's paper introduced a sound methodology for estimating how much screening can move the needle. The 2021 paper plugged disease values into the same framework. The math hadn't changed. The inputs had.

Karavani's "limited utility" conclusion still gets cited as a general indictment of PGT-P in Science, MIT Technology Review, STAT News, and policy papers. Lencz 2021, despite being the natural follow-up from the same group, gets almost no coverage in those same outlets. The result is that "2.5 IQ points" has become a stand-in for "embryo screening doesn't work," a generalization the original paper never made.

If you've read about PGT-P and come away thinking "the science shows it barely works," chances are you read an article that took the 2.5-point figure as a permanent ceiling. The number reflected what 2019 input values could deliver. It was never meant as a verdict on what the technology can do.

The disease data is in

Everything above is a mathematical argument about what should happen when you plug disease values into the framework. The empirical question is two-part: how predictive are the disease scores in absolute terms, and does that performance survive the within-family setting that matters for embryo screening.

On the first question, Herasight's polygenic scores explain roughly 24% of the genetic susceptibility for the conditions we screen. That's a substantial fraction of the heritable component, and well above what 2019 IQ predictors could deliver on cognitive variance. On the second, Moore et al. (2025) tested 17 disease polygenic scores on sibling pairs, validating performance in exactly the context that matters for embryo screening: within-family prediction. Embryos from the same IVF cycle are siblings. If a polygenic score works between unrelated individuals but breaks down between siblings, it's not useful for screening. So you have to check.

Sixteen of the 17 scores showed no decrease in predictive performance compared to population-level estimates. The predictions held up within families. That's not a theoretical argument. It's an empirical result from tens of thousands of sibling comparisons. Where Herasight has head-to-head data, the within-family R² runs 122 to 193% higher than competitors.

For type 2 diabetes specifically, the numbers are concrete. In families where both parents have T2D, the absolute risk difference between the highest- and lowest-risk embryos in a set of ten can exceed 40 percentage points.

Forty percentage points of risk difference between embryos from the same family.

That's not a rounding error.

And it's not just type 2 diabetes. The same pattern holds across conditions with strong polygenic components: coronary artery disease, schizophrenia, bipolar disorder. In each case, the within-family predictions tracked closely with population-level estimates. The scores work where they need to work.

Critics raised a legitimate concern about this: do polygenic scores actually predict differences between siblings, or only between unrelated people? Visscher et al. made this argument in a 2021 NEJM paper, arguing that predictions validated on unrelated individuals might break down when you compare siblings. Moore et al. answered it empirically: 16 out of 17 disease scores held up within families. The concern was fair. And the empirical answer was clear.

This is what closes the loop with Karavani. The 2019 paper applied its expected-gain framework to continuous traits with weak predictors and got modest numbers. Plug today's disease scores into the same framework and the answer changes. Not because the math changed. Because the inputs did.

Even the IQ numbers are obsolete

Karavani's paper outlined a sound methodology for estimating expected gains from screening. It's essentially the same general approach we use today. The problem isn't the math. The problem is that 2019 input values have been treated as the permanent state of the field, when they were always going to improve.

Karavani used educational-attainment GWAS data from 2019, which explained roughly 4.3% of cognitive variance. That was the best available at the time. It isn't anymore. We released CogPGT 1.0 in 2025, which uses purpose-built scores validated within families on latent general cognitive ability. The within-family R-squared is 20.79% of latent variance. That's nearly a five-fold improvement over what Karavani had to work with.

Compare the practical outputs carefully. Karavani's 2.5-point figure is the expected gain from picking the top embryo of a set of ten using 2019 predictors. CogPGT measures an average 8.5-point spread between best and worst embryos in a set of three using 2025 predictors. The two numbers measure different things, but they're driven by the same underlying R² that's risen nearly fivefold. That isn't an incremental improvement. It's a qualitative change in what the technology can do.

Using Karavani's 2019 numbers to evaluate today's cognitive screening is like benchmarking a 2025 GPU against a 2019 test suite and concluding that compute hasn't improved.

The genomics field moves fast. Predictors improve as datasets grow and methods advance. Treating a 2019 benchmark as a permanent ceiling is like evaluating modern electric vehicles based on what the first Nissan Leaf could do.

That doesn't mean disease screening and cognitive screening are the same thing. They have different value propositions and different scientific foundations. But critics who cite the "2.5 IQ points" number as if it represents a current limit on any type of embryo screening are working from data that's six years out of date.

Where the evidence actually stands

The Karavani paper isn't wrong. It accurately modeled what it set out to model: mean shifts on continuous traits using 2019-era predictors. The problem is entirely in how the paper has been applied since. When someone cites "only 2.5 IQ points" to argue that embryo screening "barely works," they're treating 2019 input values as a permanent ceiling. Those values have improved nearly fivefold for cognitive screening since. And for disease screening they were never the right inputs to begin with.

The evidence has moved on. We think the conversation should too. The 2019 numbers were never meant as a permanent verdict on what embryo screening can do. Treating them that way means giving prospective parents an out-of-date picture of a technology that's changed substantially in six years.

If you're curious about what current PGT-P could mean for your family's specific situation, please reach out to us. We're happy to walk through the data together.