Multiple comparisons correction
When you run many tests at once — multiple variants, multiple metrics, many segments — the chance that at least one shows a false positive grows with the number of comparisons. Multiple-comparisons corrections counter this: the Bonferroni method controls the family-wise error rate by dividing α across tests, while the Benjamini-Hochberg procedure controls the false discovery rate, trading some power for fewer false 'wins'.
Why many tests inflate false positives
With one test at α = 0.05 there is a 5% chance of a false positive. Run 20 independent comparisons and the chance that at least one is a false positive rises toward 1 − 0.95^20 ≈ 64%. This is the multiple-comparisons problem: the more questions you ask of the same data, the more likely pure noise produces a 'winner' somewhere.
- Family-wise error grows with the number of comparisons
- Comes from multiple arms, metrics, or segment slices
- An uncorrected p exaggerates confidence in any one find
Two correction families
Bonferroni controls the family-wise error rate by testing each comparison at α/m for m comparisons — simple and conservative, it can sacrifice power when m is large. Benjamini-Hochberg instead controls the false discovery rate (the expected fraction of declared 'discoveries' that are false), which is less strict and keeps more power, suitable when some false positives are tolerable. Pre-register how many comparisons you will make so the correction is honest.
Deciding which guardrail and primary metrics count up front limits m.
How it appears in analytics and logs
An uncorrected 'significant' result picked from many comparisons is much more likely to be a false positive than its nominal p-value suggests.
Diagnostic use case
Apply a correction whenever a single experiment evaluates several arms, metrics, or segments, so the overall false-positive rate stays at the level you intended.
What WebmasterID can help detect
WebmasterID supplies the per-arm and per-segment conversion counts; applying a correction across them is your analysis choice.
Common mistakes
- Slicing into many segments and reporting the one that 'won' without correction.
- Adding metrics mid-test without accounting for the extra comparisons.
- Using Bonferroni on hundreds of tests and killing all power.
Privacy and accuracy notes
Corrections operate on aggregate test statistics, not individuals. No personal data is required.
Related pages
- Type I and type II errors
Every test can be wrong two ways. A type I error (false positive) declares a difference when none exists; its rate is the significance level α you choose. A type II error (false negative) misses a real difference; its rate is β, and 1−β is statistical power. Lowering one rate, holding sample size fixed, usually raises the other — the trade-off you manage when designing a test.
- Pitfalls of segmenting test results
Segmenting experiment results — by device, country, source — is useful, but slicing a non-significant test until some segment 'wins' is a recipe for false positives. Each extra segment is another comparison; enough slices guarantee a spurious hit. Legitimate segment analysis is pre-planned or corrected for multiplicity. This page separates honest segmentation from data dredging.
- Primary vs secondary metrics in tests
Every experiment should name a single primary metric that determines the decision, and a small set of secondary metrics that add context. The distinction matters statistically: testing many metrics inflates the chance one moves by luck, so the decision must rest on the pre-chosen primary. This page explains the roles and the multiple-comparisons risk.
- WebmasterID docs
How conversion events feed your own analysis.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.