Win rate and experiment portfolio
Across mature experimentation programs, a large share of tests show no improvement — flat and negative results are the norm, not failure. Win rate (the fraction of tests that win) is a portfolio property to interpret, not a target to maximise: chasing a high win rate encourages timid tests and peeking. What matters is cumulative validated impact, balanced against learning from null and negative results.
Nulls and negatives are normal
Practitioners at companies with large experimentation programs report that only a minority of tests produce a clear win; many are flat and some are negative. This is expected — ideas are hypotheses, and most hypotheses are wrong or neutral. A program that 'wins' almost every test is more likely peeking, testing only trivially safe changes, or measuring loosely than genuinely brilliant. Negative results are valuable: they stop you shipping harm.
- Most tests are flat or negative — that is healthy
- A near-perfect win rate is a warning, not a triumph
- Negative results prevent shipping harm
Measure the portfolio, not the hit rate
Win rate alone is gameable and misleading: you can inflate it by only running safe tests or by peeking until something crosses significance. Better portfolio measures are cumulative validated impact (the summed, holdout-confirmed effect of shipped winners) and learning velocity (hypotheses resolved per period). Pair these with discipline — pre-registered metrics, fixed durations, guardrails — so the impact you count is real.
This reframes the winner's curse and prioritisation at the program level.
How it appears in analytics and logs
A suspiciously high win rate can signal peeking, weak guardrails, or only testing safe bets; many nulls is normal and still informative.
Diagnostic use case
Judge an experimentation program by cumulative validated impact and learning, not by maximising the share of tests that 'win'.
What WebmasterID can help detect
WebmasterID's first-party conversion data feeds each experiment whose outcomes roll up into the portfolio view.
Common mistakes
- Treating a high win rate as the goal, encouraging timid tests and peeking.
- Discarding null and negative results as wasted effort.
- Counting unvalidated 'wins' without holdout confirmation.
Privacy and accuracy notes
Portfolio metrics are aggregates over experiments, not individuals; no personal data is involved.
Related pages
- The winner’s curse in experiments
The winner's curse is the tendency for the measured effect of a 'winning' experiment to overstate the true effect, because selecting on statistical significance favours upward noise. It explains why shipped wins often underdeliver in production. Larger samples and replication shrink the inflation. This page explains the mechanism and how to set realistic expectations after a win.
- Experiment roadmap and prioritization
An experiment roadmap is a prioritised backlog of test ideas, ordered so that limited testing capacity goes to the experiments most likely to teach or earn the most per unit of effort. Frameworks such as ICE (Impact, Confidence, Ease) and PIE (Potential, Importance, Ease) provide a structured score — useful for comparison, but built from subjective estimates that should not be mistaken for measured fact.
- Holdout groups
A holdout group is a randomly chosen set of users who are intentionally excluded from one or more shipped changes, so their behaviour serves as a long-run baseline. Where an A/B test measures one change briefly, a holdout measures the combined, sustained effect of everything launched, guarding against the slow accumulation of small regressions or overstated wins.
- Website observability
Aggregate validated impact across experiments.
Sources and verification notes
- Kohavi & Thomke — The Surprising Power of Online Experiments (Harvard Business Review)Documents that a minority of experiments produce clear wins.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.