One-tailed vs two-tailed tests
A two-tailed test asks whether the variant differs from control in either direction and splits α across both tails. A one-tailed test puts all of α on a single direction, so it is more sensitive to an effect that way — but blind to a move the other way, including the variant being worse. Because variants can hurt as well as help, two-tailed is the conservative default for conversion experiments.
Where α goes
With α = 0.05, a two-tailed test places 0.025 in each tail, declaring significance if the variant is meaningfully higher OR lower. A one-tailed test places the full 0.05 in one tail, so it reaches significance with a smaller observed effect — but only in the direction you chose, and it treats a move the other way as 'no effect'. That extra sensitivity is borrowed from giving up the ability to detect the opposite outcome.
- Two-tailed: detects difference in either direction
- One-tailed: more sensitive one way, blind the other
- Switching to one-tailed after seeing data is p-hacking
Why two-tailed is the default
In a conversion test a new design can plausibly reduce conversions, not just raise them. A two-tailed test will catch that harm; a one-tailed test pointed at 'improvement' will not. Choosing one-tailed only to lower the bar for significance is a form of p-hacking. Reserve one-tailed tests for situations where the opposite direction is impossible or genuinely irrelevant, and fix the choice before collecting data.
State the tail choice in the experiment plan.
How it appears in analytics and logs
A one-tailed 'win' has no power to flag harm in the other direction, so it can hide a variant that actually lowers conversion.
Diagnostic use case
Default to two-tailed so a harmful variant is detected; reserve one-tailed for the rare case where only one direction is possible or relevant.
What WebmasterID can help detect
WebmasterID supplies the conversion counts; whether you test one or two tails is your analysis decision, made before launch.
Common mistakes
- Switching to one-tailed after the data leans your way to reach significance.
- Using one-tailed and missing a variant that actually hurts conversion.
- Not stating the tail choice in the experiment plan.
Privacy and accuracy notes
Tail choice is a property of the test statistic, not of any individual's data.
Related pages
- Statistical significance and p-values
A result is 'statistically significant' when it would be unlikely if there were really no effect. The p-value is the probability of seeing data at least as extreme as yours assuming the null hypothesis is true — it is not the probability the variant is better, and not a measure of how big the effect is. Significance and practical importance are different questions.
- P-value misconceptions
The p-value is one of the most misread numbers in experimentation. It is the probability of seeing data at least as extreme as observed if the null hypothesis were true — not the probability the null is true, not the probability of a fluke, and not a measure of effect size. The American Statistical Association issued a formal statement listing exactly these misconceptions.
- Guardrail metrics in experiments
Guardrail metrics are the secondary measures you monitor during an experiment to make sure a change that improves the primary metric does not quietly damage something important — load time, retention, refunds, support load. They turn 'did the target go up' into the fuller question 'did the target go up without breaking anything'.
- WebmasterID docs
How conversion events feed your own analysis.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.