How long to run an A/B test
An A/B test runs until it has collected the sample size its design requires — derived from the baseline rate, the minimum detectable effect, and the chosen power. Duration also has to span full business cycles (weekday/weekend) to avoid day-of-week bias. Stopping the moment a result looks significant inflates false positives. This page explains how duration is set honestly.
Duration follows from sample size
You do not pick a duration; you pick a sample size and let traffic determine how long collecting it takes. The sample size comes from the baseline conversion rate, the minimum detectable effect you care about, and your power and significance targets. Daily eligible traffic then converts that sample into a number of days.
Cover full business cycles
Even after hitting the sample target, run across complete weekly cycles. Conversion behaviour differs by day of week and by pay cycle; a test that runs Tuesday to Thursday can be biased by who shows up midweek. Whole-week multiples reduce day-of-week confounding.
- Sample size sets the floor on data needed
- Run whole weeks to absorb day-of-week effects
- Do not stop the instant a peek looks significant
The early-stopping trap
Repeatedly checking significance and stopping at the first 'win' is the peeking problem: it dramatically raises the false-positive rate above the nominal threshold. If you need the option to stop early, use a method designed for it — sequential testing or a Bayesian approach — rather than fixed-horizon tests read continuously.
How it appears in analytics and logs
A test that 'reached significance' on day two has almost certainly been peeked into a false positive. Short tests also miss weekly seasonality.
Diagnostic use case
Compute the required sample size first, then divide by daily eligible traffic to estimate duration; run at least one full weekly cycle and resist stopping on an early peek.
What WebmasterID can help detect
WebmasterID's first-party traffic counts let you estimate eligible daily volume realistically, so the duration you plan reflects the audience you actually have.
Common mistakes
- Stopping as soon as p dips below the threshold.
- Running only a few days and missing weekly seasonality.
- Picking a duration before computing the needed sample size.
Privacy and accuracy notes
Duration is a function of aggregate counts and traffic. Planning it needs no personal data — only the baseline rate and traffic volume.
Related pages
- Sample size in experiments
Sample size is the number of subjects per arm an experiment needs to detect a chosen effect with acceptable error rates. It is computed in advance from the baseline rate, the minimum effect worth detecting, and the false-positive and false-negative rates you accept. Too small and you miss real effects; running until 'it looks good' inflates false positives.
- The peeking problem in A/B tests
The peeking problem is checking an experiment over and over and stopping the moment it crosses significance. Because each look is another chance for noise to cross the threshold, repeated peeking inflates the false-positive rate well above the nominal level. The fixes are a pre-set sample size or a sequential method designed for continuous monitoring.
- Sequential testing for experiments
Sequential testing is a family of statistical methods designed for repeated looks at accumulating data. Naive peeking at a fixed-horizon test inflates the false-positive rate; sequential methods such as always-valid p-values and group sequential boundaries adjust for the multiple looks so you can monitor and stop early while keeping error control.
- Minimum detectable effect (MDE)
The minimum detectable effect (MDE) is the smallest change in your metric that an experiment is set up to detect reliably. It is an input you choose, not an output: a smaller MDE demands more traffic. Setting the MDE to the smallest difference that would actually matter to the business keeps experiments honestly sized.
Sources and verification notes
- Wikipedia — Sequential analysisWhy fixed-horizon tests should not be peeked.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.