Sequential testing for experiments
Sequential testing is a family of statistical methods designed for repeated looks at accumulating data. Naive peeking at a fixed-horizon test inflates the false-positive rate; sequential methods such as always-valid p-values and group sequential boundaries adjust for the multiple looks so you can monitor and stop early while keeping error control.
What this means
A standard fixed-horizon test fixes the sample size in advance and is only valid if you analyse once, at the end. Sequential testing instead provides procedures that remain valid under continuous or repeated monitoring. Approaches include group sequential designs with pre-set interim boundaries (alpha spending) and 'always-valid' inference that yields confidence sequences and p-values holding at every look.
Why it solves peeking
The peeking problem is that each extra look at a fixed-horizon test is another chance to cross the threshold by luck, so many looks drive the real false-positive rate well above the nominal level. Sequential methods build the multiplicity of looks into the math: the boundaries are wider or the p-values are adjusted so that the overall error rate stays controlled no matter how often you check.
The trade-off is that honest early stopping typically needs a clearly larger effect, or a slightly larger total sample if the effect is small.
- Valid under continuous or repeated monitoring
- Group sequential boundaries or always-valid p-values
- Controls the error rate that naive peeking destroys
How it appears in analytics and logs
A sequential test crossing its boundary is a valid stop signal even though you looked many times. With a fixed-horizon test, the same repeated looking would have invalidated the error rate.
Diagnostic use case
Use a sequential method when you want to watch a test as it runs and stop as soon as there is enough evidence, without the false-positive inflation that ad-hoc peeking causes.
What WebmasterID can help detect
WebmasterID measures the first-party events that accumulate over an experiment; a sequential method is one valid way to decide when those accumulated counts justify stopping.
Common mistakes
- Applying fixed-horizon stopping rules while peeking.
- Assuming early stops are free of any sample-size cost.
- Mixing a sequential boundary with a fixed-horizon p-value.
Privacy and accuracy notes
Sequential methods analyse aggregate streams of conversions and exposures, not individuals. This page is educational and not statistical consulting.
Related pages
- The peeking problem in A/B tests
The peeking problem is checking an experiment over and over and stopping the moment it crosses significance. Because each look is another chance for noise to cross the threshold, repeated peeking inflates the false-positive rate well above the nominal level. The fixes are a pre-set sample size or a sequential method designed for continuous monitoring.
- Statistical significance and p-values
A result is 'statistically significant' when it would be unlikely if there were really no effect. The p-value is the probability of seeing data at least as extreme as yours assuming the null hypothesis is true — it is not the probability the variant is better, and not a measure of how big the effect is. Significance and practical importance are different questions.
- Sample size in experiments
Sample size is the number of subjects per arm an experiment needs to detect a chosen effect with acceptable error rates. It is computed in advance from the baseline rate, the minimum effect worth detecting, and the false-positive and false-negative rates you accept. Too small and you miss real effects; running until 'it looks good' inflates false positives.
- Frequentist vs Bayesian experiment analysis
Frequentist and Bayesian are two coherent ways to analyse the same experiment data. Frequentist methods ask how likely the observed data is under a null hypothesis and report p-values and confidence intervals. Bayesian methods combine a prior with the data to report posterior probabilities and credible intervals. Each has assumptions and failure modes; neither is universally 'correct'.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.