P-value misconceptions
The p-value is one of the most misread numbers in experimentation. It is the probability of seeing data at least as extreme as observed if the null hypothesis were true — not the probability the null is true, not the probability of a fluke, and not a measure of effect size. The American Statistical Association issued a formal statement listing exactly these misconceptions.
What this means
Formally, a p-value is the probability, computed under a specified statistical model and the null hypothesis, of obtaining a result equal to or more extreme than what was observed. That is a statement about the data given the null — not about the null given the data. Inverting it ('there is a 5% chance the null is true') is a logical error, not a conservative rounding.
What the ASA warned against
The American Statistical Association's 2016 statement set out principles every experimenter should internalise: p-values do not measure the probability that the hypothesis is true or that the data were produced by chance alone; conclusions should not be based only on whether a p-value passes a threshold; and a p-value does not measure the size or importance of an effect. Proper inference needs context, effect sizes, and design quality, not a single number.
In practice this means a 'significant' result on a tiny, meaningless effect can be worthless, and a non-significant result is not proof of no effect.
- Not the probability the null is true
- Not the probability the result is a fluke
- Not a measure of effect size or business value
How it appears in analytics and logs
A p-value of, say, a small number means the data would be unusual if there were no effect. It does not tell you the probability the effect is real, the size of the effect, or whether the result matters commercially.
Diagnostic use case
Read p-values correctly so you do not over- or under-state evidence: a small p-value flags surprise under the null, nothing more, and must be paired with effect size and context.
What WebmasterID can help detect
WebmasterID provides the first-party event counts a significance test consumes; interpreting the resulting p-value correctly is on the analyst, and this page helps.
Common mistakes
- Reading p as 'probability the result is wrong'.
- Treating p just above threshold as proof of no effect.
- Reporting significance without an effect size.
Privacy and accuracy notes
P-values are computed from aggregate counts, not personal data. This page is educational and not a substitute for a statistician on high-stakes decisions.
Related pages
- Statistical significance and p-values
A result is 'statistically significant' when it would be unlikely if there were really no effect. The p-value is the probability of seeing data at least as extreme as yours assuming the null hypothesis is true — it is not the probability the variant is better, and not a measure of how big the effect is. Significance and practical importance are different questions.
- Frequentist vs Bayesian experiment analysis
Frequentist and Bayesian are two coherent ways to analyse the same experiment data. Frequentist methods ask how likely the observed data is under a null hypothesis and report p-values and confidence intervals. Bayesian methods combine a prior with the data to report posterior probabilities and credible intervals. Each has assumptions and failure modes; neither is universally 'correct'.
- The peeking problem in A/B tests
The peeking problem is checking an experiment over and over and stopping the moment it crosses significance. Because each look is another chance for noise to cross the threshold, repeated peeking inflates the false-positive rate well above the nominal level. The fixes are a pre-set sample size or a sequential method designed for continuous monitoring.
- Minimum detectable effect (MDE)
The minimum detectable effect (MDE) is the smallest change in your metric that an experiment is set up to detect reliably. It is an input you choose, not an output: a smaller MDE demands more traffic. Setting the MDE to the smallest difference that would actually matter to the business keeps experiments honestly sized.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.