The winner’s curse in experiments
The winner's curse is the tendency for the measured effect of a 'winning' experiment to overstate the true effect, because selecting on statistical significance favours upward noise. It explains why shipped wins often underdeliver in production. Larger samples and replication shrink the inflation. This page explains the mechanism and how to set realistic expectations after a win.
Why selection inflates effects
Every measured effect is the true effect plus noise. When you keep only the experiments that crossed a significance bar, you preferentially keep the ones where noise pushed the estimate up. The surviving winners therefore have measured effects biased above their true values — selection itself causes the inflation.
It hits marginal wins hardest
The smaller the sample and the closer the result sat to the significance threshold, the larger the curse. A win that barely cleared the bar in an underpowered test is the most likely to disappoint in production. Well-powered tests with comfortable margins suffer far less inflation.
- Selecting significant results favours upward noise
- Underpowered, marginal wins are inflated most
- Larger samples and replication reduce the bias
Setting honest expectations
Treat a test's point estimate as an optimistic ceiling, not a forecast — especially for marginal wins. Replicate important results, prefer adequately powered tests, and monitor the effect after launch. Building a roadmap on uncorrected, barely-significant uplifts is how a backlog of 'wins' fails to add up to real growth.
How it appears in analytics and logs
A win that scraped past the significance threshold likely overstates its true effect. Production typically delivers less than the test's point estimate suggested.
Diagnostic use case
Discount the headline uplift of a barely-significant win when forecasting impact; expect production results below the test estimate, and replicate before betting big.
What WebmasterID can help detect
WebmasterID's first-party measurement lets you re-observe a shipped win over time, so you can compare the post-launch effect against the test estimate and detect inflation.
Common mistakes
- Forecasting production impact from a marginal win's point estimate.
- Shipping underpowered wins without replication.
- Summing many barely-significant uplifts as if each is certain.
Privacy and accuracy notes
The winner's curse is a property of how effects are selected, estimated from aggregate results. No personal data is involved.
Related pages
- Regression to the mean in tests
Regression to the mean is the statistical tendency for an extreme measurement to be closer to the average on the next observation. In experimentation it explains why a page picked because it converted unusually well often 'declines' afterward, and why early test readings overstate effects. Recognising it prevents crediting a change for a return to normal. This page explains the mechanism.
- Sample size in experiments
Sample size is the number of subjects per arm an experiment needs to detect a chosen effect with acceptable error rates. It is computed in advance from the baseline rate, the minimum effect worth detecting, and the false-positive and false-negative rates you accept. Too small and you miss real effects; running until 'it looks good' inflates false positives.
- The peeking problem in A/B tests
The peeking problem is checking an experiment over and over and stopping the moment it crosses significance. Because each look is another chance for noise to cross the threshold, repeated peeking inflates the false-positive rate well above the nominal level. The fixes are a pre-set sample size or a sequential method designed for continuous monitoring.
- P-value misconceptions
The p-value is one of the most misread numbers in experimentation. It is the probability of seeing data at least as extreme as observed if the null hypothesis were true — not the probability the null is true, not the probability of a fluke, and not a measure of effect size. The American Statistical Association issued a formal statement listing exactly these misconceptions.
Sources and verification notes
- Wikipedia — Winner’s curseSelection-induced overestimation of effects.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.