The peeking problem in A/B tests
The peeking problem is checking an experiment over and over and stopping the moment it crosses significance. Because each look is another chance for noise to cross the threshold, repeated peeking inflates the false-positive rate well above the nominal level. The fixes are a pre-set sample size or a sequential method designed for continuous monitoring.
What this means
Classical significance tests assume you decide the sample size in advance and look once at the end. Peeking breaks that assumption: every time you check and reserve the right to stop, you give noise another opportunity to wander across the threshold. Do this many times and the probability of a false 'win' climbs far above the 5% you thought you were controlling.
How to avoid it
The simplest fix is to fix the sample size up front and only conclude when you reach it. If you genuinely need to monitor continuously — to stop a harmful change early — use a sequential testing method (such as alpha-spending or always-valid inference) that is mathematically designed to allow repeated looks without inflating error. What you must not do is run an ordinary fixed-horizon test and stop at the first green result.
Peeking is one of the most common reasons A/B 'wins' fail to replicate.
- Each extra look adds a chance for noise to cross
- Fixed-horizon tests assume one look at the end
- Sequential methods are built for continuous monitoring
How it appears in analytics and logs
If a 'significant' result came from peeking, its significance is overstated — the real chance of a false positive is higher than the p-value suggests, so the win may not be real.
Diagnostic use case
Avoid stopping a fixed-horizon test early by repeatedly checking it; either run to the planned sample or use a method built for sequential looks.
What WebmasterID can help detect
WebmasterID reports the conversion counts you monitor first-party; the discipline of when to stop the test stays with you.
Common mistakes
- Stopping a fixed-horizon test at the first significant peek.
- Treating a peeked p-value as if it controlled error correctly.
- Monitoring continuously without a sequential method.
Privacy and accuracy notes
Peeking is a procedural pitfall over aggregate counts; it involves no personal data. WebmasterID supplies the first-party counts being monitored.
Related pages
- Statistical significance and p-values
A result is 'statistically significant' when it would be unlikely if there were really no effect. The p-value is the probability of seeing data at least as extreme as yours assuming the null hypothesis is true — it is not the probability the variant is better, and not a measure of how big the effect is. Significance and practical importance are different questions.
- Sample size in experiments
Sample size is the number of subjects per arm an experiment needs to detect a chosen effect with acceptable error rates. It is computed in advance from the baseline rate, the minimum effect worth detecting, and the false-positive and false-negative rates you accept. Too small and you miss real effects; running until 'it looks good' inflates false positives.
- A/B testing fundamentals
An A/B test randomly assigns visitors to a control (A) or a variant (B), shows each group one version, and compares a pre-chosen metric. Random assignment is what lets you attribute a difference to the change rather than to who happened to see it. The discipline is in deciding the metric and sample size before you start, not after you peek at the numbers.
- WebmasterID docs
Monitor counts without changing your stopping rule.
Sources and verification notes
- NIST/SEMATECH e-Handbook — Sequential and repeated testing conceptsBackground on why repeated monitoring requires adjusted procedures.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.