Simpson’s paradox in experiments
Simpson's paradox is when an effect that holds within every subgroup reverses or vanishes once the subgroups are pooled. In experiments it appears when the mix of traffic differs between arms — so the aggregate is driven by composition, not the change. It is a vivid reason to check segments and to ensure arms are comparable. This page explains how it arises and how to avoid being fooled.
How the reversal happens
Simpson's paradox arises from a lurking variable that influences both group membership and the outcome. If one arm happens to draw more high-converting traffic (say more returning visitors), it can look better overall even if the change itself helped no one — the aggregate reflects who was in each arm, not what the change did.
Why randomisation usually prevents it
Proper random assignment makes the traffic mix statistically the same in each arm, so composition cannot drive the result. The paradox tends to surface when assignment is broken — biased redirects, bot filtering that hits arms unevenly, or post-hoc segment slicing — which is also the signature of sample ratio mismatch.
- Caused by an uneven mix between arms
- Randomisation balances the mix and prevents it
- Broken assignment (SRM) is a common cause
Reading segments without being fooled
The cure is not to pick whichever number you like. Check whether the arms are balanced; if they are, trust the overall result. If they are not, the imbalance itself is the bug to fix before drawing any conclusion — neither the pooled nor the sliced number is trustworthy until arms are comparable.
How it appears in analytics and logs
A variant that wins in every segment yet loses overall signals that the arms have different audience mixes — the aggregate is a weighting artefact, not the true effect.
Diagnostic use case
When an overall result conflicts with consistent per-segment results, check whether the traffic mix differs between arms before trusting either number.
What WebmasterID can help detect
WebmasterID's segmentation over first-party events lets you compare results within and across segments, so a composition-driven reversal is visible rather than hidden in the total.
Common mistakes
- Picking the pooled or segmented number that flatters the variant.
- Ignoring that arms have different audience compositions.
- Slicing into segments after the fact to manufacture a win.
Privacy and accuracy notes
Detecting the paradox uses aggregate counts per segment, not individual records. No personal identifiers are required to spot it.
Related pages
- Confounding variables in conversion
A confounding variable is a third factor that affects both the thing you changed and the outcome you measured, producing a spurious association. Confounders are why 'we shipped X and conversions rose' is weak evidence — a campaign, a season, or a price change could be the real cause. Randomised experiments neutralise confounders by design. This page explains the concept and the defence.
- Sample ratio mismatch (SRM)
Sample ratio mismatch (SRM) is when the observed allocation of users to experiment arms diverges from the planned ratio by more than chance allows — for example a 50/50 test that lands far from 50/50. It signals a bug in assignment, logging, or filtering, and a test with SRM should not be trusted regardless of how good the headline result looks.
- Pitfalls of segmenting test results
Segmenting experiment results — by device, country, source — is useful, but slicing a non-significant test until some segment 'wins' is a recipe for false positives. Each extra segment is another comparison; enough slices guarantee a spurious hit. Legitimate segment analysis is pre-planned or corrected for multiplicity. This page separates honest segmentation from data dredging.
- Segmentation for conversion analysis
Segmentation divides visitors into groups — by source, device, geography, or behaviour — so you can compare conversion within comparable cohorts. A single blended conversion rate can hide that one segment converts well and another barely at all. The discipline is choosing segments that answer a question without slicing so finely that each group becomes noise.
Sources and verification notes
- Wikipedia — Simpson’s paradoxTrend reversal on aggregation and lurking variables.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.