Type I and type II errors
Every test can be wrong two ways. A type I error (false positive) declares a difference when none exists; its rate is the significance level α you choose. A type II error (false negative) misses a real difference; its rate is β, and 1−β is statistical power. Lowering one rate, holding sample size fixed, usually raises the other — the trade-off you manage when designing a test.
The two ways a test is wrong
Hypothesis testing frames a decision against a null hypothesis of 'no effect'. A type I error rejects that null when it is actually true — you see a winner that is really noise. A type II error fails to reject the null when it is actually false — a real effect slips through as 'not significant'. The names come from Neyman and Pearson's decision framework.
- Type I: false positive, rate = α (e.g. 0.05)
- Type II: false negative, rate = β
- Power = 1 − β, the chance of catching a real effect
The trade-off you control
With sample size fixed, tightening α (fewer false positives) widens the region where you fail to reject, raising β (more false negatives), and vice versa. The only way to lower both at once is to collect more data or test a larger effect. This is why power and sample-size planning happen before launch, not after.
Choose α and β from the cost of each mistake: shipping a useless change versus missing a good one.
How it appears in analytics and logs
A significant result might be a type I error; a flat result might be a type II error from too little data. Neither outcome is proof on its own.
Diagnostic use case
Decide your tolerance for false positives (α) and false negatives (β) before running, then size the test so both error rates are acceptable.
What WebmasterID can help detect
WebmasterID supplies the first-party conversion counts that determine your realised power; the α and β trade-off stays your decision.
Common mistakes
- Treating a non-significant result as proof of no effect (ignoring type II error).
- Setting α very low without checking what it does to power.
- Forgetting that running many tests multiplies the chance of a type I error.
Privacy and accuracy notes
Error rates are properties of aggregate test statistics, not individuals. No personal data is needed to reason about α and β.
Related pages
- Statistical power
Power is the probability that a test correctly rejects the null when a true effect of a stated size exists: power = 1 − β. It rises with sample size, with the size of the effect you want to catch, and with a looser significance threshold; it falls with higher metric variance. Underpowered tests waste traffic by failing to detect real wins, so power is planned before launch.
- Statistical significance and p-values
A result is 'statistically significant' when it would be unlikely if there were really no effect. The p-value is the probability of seeing data at least as extreme as yours assuming the null hypothesis is true — it is not the probability the variant is better, and not a measure of how big the effect is. Significance and practical importance are different questions.
- Multiple comparisons correction
When you run many tests at once — multiple variants, multiple metrics, many segments — the chance that at least one shows a false positive grows with the number of comparisons. Multiple-comparisons corrections counter this: the Bonferroni method controls the family-wise error rate by dividing α across tests, while the Benjamini-Hochberg procedure controls the false discovery rate, trading some power for fewer false 'wins'.
- WebmasterID docs
How conversion events feed your own analysis.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.