Statistical power
Power is the probability that a test correctly rejects the null when a true effect of a stated size exists: power = 1 − β. It rises with sample size, with the size of the effect you want to catch, and with a looser significance threshold; it falls with higher metric variance. Underpowered tests waste traffic by failing to detect real wins, so power is planned before launch.
What power depends on
Power for a conversion test is driven by four inputs: the baseline rate and its variance, the minimum effect you want to detect, the significance level α, and the sample size per variant. Increase traffic or the target effect and power rises; tighten α or face a noisier metric and power falls. A convention is to design for 0.8 power, meaning an 80% chance of detecting the stated effect if it is real.
- power = 1 − β (β is the type II error rate)
- Rises with sample size and effect size
- Falls with higher variance or a stricter α
Why underpowered tests mislead
An underpowered test that comes back 'not significant' tells you almost nothing — it may have lacked the data to see a genuine effect. Worse, among the underpowered tests that do reach significance, the estimated effect tends to be inflated (the winner's curse). Plan power up front so a null result is interpretable and a significant one is trustworthy.
Power analysis is the same calculation as sample sizing, viewed from the other direction.
How it appears in analytics and logs
Low power means a flat result is uninformative — the test may simply have been too small to see the effect you cared about.
Diagnostic use case
Set a target power (a common convention is 0.8) and the smallest effect worth catching, then size the test so you can actually detect it.
What WebmasterID can help detect
WebmasterID's first-party conversion volumes tell you the traffic available, which bounds the power a test can reach in a given window.
Common mistakes
- Reading 'not significant' from an underpowered test as 'no effect'.
- Designing for a tiny effect without the traffic to detect it.
- Ignoring metric variance when estimating power.
Privacy and accuracy notes
Power is a function of aggregate counts and variance, not individuals. No personal data is required to compute it.
Related pages
- Type I and type II errors
Every test can be wrong two ways. A type I error (false positive) declares a difference when none exists; its rate is the significance level α you choose. A type II error (false negative) misses a real difference; its rate is β, and 1−β is statistical power. Lowering one rate, holding sample size fixed, usually raises the other — the trade-off you manage when designing a test.
- Sample size in experiments
Sample size is the number of subjects per arm an experiment needs to detect a chosen effect with acceptable error rates. It is computed in advance from the baseline rate, the minimum effect worth detecting, and the false-positive and false-negative rates you accept. Too small and you miss real effects; running until 'it looks good' inflates false positives.
- Effect size
Effect size is the magnitude of a difference — for conversion, the absolute lift (e.g. 3.0% to 3.3% is +0.3 points) or the relative lift (+10%). It is distinct from significance: a p-value says whether an effect is plausibly non-zero, effect size says whether it is big enough to matter. The smaller the effect you want to catch, the more traffic you need, so effect size anchors test planning.
- Event Explorer
See conversion volumes that bound achievable power.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.