Sample size in experiments
Sample size is the number of subjects per arm an experiment needs to detect a chosen effect with acceptable error rates. It is computed in advance from the baseline rate, the minimum effect worth detecting, and the false-positive and false-negative rates you accept. Too small and you miss real effects; running until 'it looks good' inflates false positives.
What this means
Sample size is how many subjects per variant you need before the experiment can reliably tell signal from noise. It depends on four things: the baseline conversion rate, the smallest effect you care about (the minimum detectable effect), the significance level (false-positive rate), and the power (one minus the false-negative rate). Fix those and the required size follows.
Why plan it first
If the sample is too small, the test is underpowered: a real effect can be present yet go undetected, and a null result means little. If instead you run with no target and stop when the numbers look good, you are peeking, which inflates false positives. Computing the size up front gives a clear stopping point and an honest interpretation either way.
Smaller effects need larger samples — detecting a tiny improvement can require far more traffic than detecting a large one.
- Depends on baseline, effect size, significance, power
- Underpowered tests miss real effects
- A pre-set size prevents result-peeking
How it appears in analytics and logs
A sample-size calculation tells you the traffic an experiment needs to reliably detect the effect you care about. Falling short means a 'no difference' result may simply be an underpowered test.
Diagnostic use case
Compute the required sample size before launching so you know how long to run and avoid both underpowered tests and result-peeking.
What WebmasterID can help detect
WebmasterID's first-party conversion counts give you the baseline rate a sample-size calculation starts from.
Common mistakes
- Launching without computing the required sample size.
- Stopping at an arbitrary point because the result looks good.
- Expecting to detect tiny effects with little traffic.
Privacy and accuracy notes
Sample-size planning uses aggregate rates, not personal data. WebmasterID supplies the first-party counts the calculation needs.
Related pages
- Minimum detectable effect (MDE)
The minimum detectable effect (MDE) is the smallest change in your metric that an experiment is set up to detect reliably. It is an input you choose, not an output: a smaller MDE demands more traffic. Setting the MDE to the smallest difference that would actually matter to the business keeps experiments honestly sized.
- Statistical significance and p-values
A result is 'statistically significant' when it would be unlikely if there were really no effect. The p-value is the probability of seeing data at least as extreme as yours assuming the null hypothesis is true — it is not the probability the variant is better, and not a measure of how big the effect is. Significance and practical importance are different questions.
- The peeking problem in A/B tests
The peeking problem is checking an experiment over and over and stopping the moment it crosses significance. Because each look is another chance for noise to cross the threshold, repeated peeking inflates the false-positive rate well above the nominal level. The fixes are a pre-set sample size or a sequential method designed for continuous monitoring.
- WebmasterID docs
How first-party counts feed your planning.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.