Pitfalls of segmenting test results
Segmenting experiment results — by device, country, source — is useful, but slicing a non-significant test until some segment 'wins' is a recipe for false positives. Each extra segment is another comparison; enough slices guarantee a spurious hit. Legitimate segment analysis is pre-planned or corrected for multiplicity. This page separates honest segmentation from data dredging.
Every slice is a new test
When an overall result is flat, splitting it into segments multiplies the number of comparisons. With enough segments, ordinary noise will produce at least one that crosses the significance line. Reporting that one as 'the test worked for mobile users' is data dredging, not a finding.
Honest segmentation
Segment analysis is legitimate when the segments are specified before the data is seen, or when you correct for the number of comparisons made. A pre-registered hypothesis that a change helps mobile specifically is testable; a hunt through twenty segments for any winner is not. The difference is whether the segment was a prediction or a discovery.
- Pre-register the segments you will examine
- Correct for multiplicity when slicing many ways
- Retest post-hoc segment wins before believing them
Spurious segments waste roadmap
Acting on a false segment win is costly twice: you ship something that doesn't help, and you may build a personalisation strategy on a mirage. Treating surprising segment results as hypotheses for a fresh, powered test is the discipline that keeps segmentation trustworthy.
How it appears in analytics and logs
A 'win' that appears only in one unplanned segment of an otherwise flat test is most likely a multiple-comparisons artefact, not a real subgroup effect.
Diagnostic use case
Treat post-hoc segment wins as hypotheses to retest, not conclusions; pre-register the segments you care about or apply a multiple-comparisons correction.
What WebmasterID can help detect
WebmasterID's first-party segments let you analyse subgroups you defined in advance, keeping segment analysis a planned step rather than an after-the-fact fishing expedition.
Common mistakes
- Slicing a flat test until some segment crosses significance.
- Reporting an unplanned segment win as a real effect.
- Building personalisation on an unreplicated subgroup result.
Privacy and accuracy notes
Segmenting uses coarse first-party dimensions over aggregate counts. It needs no individual-level identifiers to be useful.
Related pages
- Simpson’s paradox in experiments
Simpson's paradox is when an effect that holds within every subgroup reverses or vanishes once the subgroups are pooled. In experiments it appears when the mix of traffic differs between arms — so the aggregate is driven by composition, not the change. It is a vivid reason to check segments and to ensure arms are comparable. This page explains how it arises and how to avoid being fooled.
- Primary vs secondary metrics in tests
Every experiment should name a single primary metric that determines the decision, and a small set of secondary metrics that add context. The distinction matters statistically: testing many metrics inflates the chance one moves by luck, so the decision must rest on the pre-chosen primary. This page explains the roles and the multiple-comparisons risk.
- P-value misconceptions
The p-value is one of the most misread numbers in experimentation. It is the probability of seeing data at least as extreme as observed if the null hypothesis were true — not the probability the null is true, not the probability of a fluke, and not a measure of effect size. The American Statistical Association issued a formal statement listing exactly these misconceptions.
- Segmentation for conversion analysis
Segmentation divides visitors into groups — by source, device, geography, or behaviour — so you can compare conversion within comparable cohorts. A single blended conversion rate can hide that one segment converts well and another barely at all. The discipline is choosing segments that answer a question without slicing so finely that each group becomes noise.
Sources and verification notes
- Wikipedia — Data dredgingPost-hoc subgroup hunting inflates false positives.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.