Holdout groups
A holdout group is a randomly chosen set of users who are intentionally excluded from one or more shipped changes, so their behaviour serves as a long-run baseline. Where an A/B test measures one change briefly, a holdout measures the combined, sustained effect of everything launched, guarding against the slow accumulation of small regressions or overstated wins.
What this means
A holdout group is held back from receiving changes — sometimes a single feature, sometimes the whole stream of launches — for an extended period. Everyone else gets the new experiences. The holdout's metrics become the counterfactual: what the world would look like had you shipped nothing. Comparing the treated population to the holdout estimates the real, accumulated impact.
Why short tests are not enough
Individual A/B tests are short and measured at launch, when novelty and optimistic interpretation can inflate them. Ship dozens and the headline wins rarely sum to the expected total: some effects decay, some interact, some were noise. A long-running holdout reveals the true aggregate by keeping a clean baseline untouched by the changes.
The cost is real — holdout users miss improvements, and the group must be large enough and maintained long enough to detect the cumulative effect — so teams size and time-box holdouts deliberately.
- Random users kept on the old experience as a baseline
- Measures cumulative, long-run impact of shipped changes
- Costs the group the benefit of improvements meanwhile
How it appears in analytics and logs
A persistent gap between the holdout baseline and the treated population is the durable, cumulative effect of your shipped changes. A shrinking or absent gap warns that short-term test wins did not add up.
Diagnostic use case
Reserve a holdout when you want to verify that the sum of many shipped experiments actually moved the business over months, not just that each looked good in isolation at launch.
What WebmasterID can help detect
WebmasterID measures first-party conversion and engagement over long windows, which is what comparing a holdout baseline to the treated group requires.
Common mistakes
- Assuming launch-time test wins simply add up over time.
- Making the holdout too small to detect the cumulative effect.
- Contaminating the holdout by leaking changes into it.
Privacy and accuracy notes
Holdouts are defined by random aggregate assignment, not by profiling individuals. This page is educational, not statistical advice.
Related pages
- Control and variant in experiments
In an experiment the control is the existing version that acts as the baseline, and the variant is the version carrying the one change you are testing. Comparing the two only yields a clean answer when assignment is random and the variant differs from the control in exactly one way. Multiple variants are possible but each must be isolated.
- Feature flags and experiments
A feature flag is a runtime switch that turns functionality on or off for chosen users without a new deploy. Flags power gradual rollouts, kill switches, and — when the audience is split randomly and outcomes are measured — controlled experiments. Understanding the overlap keeps you from confusing a rollout (operational) with an experiment (measured comparison).
- Guardrail metrics in experiments
Guardrail metrics are the secondary measures you monitor during an experiment to make sure a change that improves the primary metric does not quietly damage something important — load time, retention, refunds, support load. They turn 'did the target go up' into the fuller question 'did the target go up without breaking anything'.
- North star metric
A north star metric is the one measure a team chooses to represent the core value it delivers, used to align decisions. Its value is focus: a single shared metric stops teams optimising in different directions. Its risk is tunnel vision — any single metric can be gamed, so it needs guardrail metrics around it and a clear link to real value.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.