Designing an experiment hypothesis
Before running an A/B test you write a hypothesis: a falsifiable statement linking a specific change to an expected effect on a named metric, for a defined audience, with a rationale. A good hypothesis fixes the success metric in advance, which prevents post-hoc metric shopping. This page covers the structure of a hypothesis and the reasoning behind it.
What a hypothesis contains
A complete experiment hypothesis has four parts: the evidence or rationale that motivates it, the specific change being made, the expected directional effect on one primary metric, and the audience it applies to. Stating the direction (increase or decrease) in advance is what makes the test falsifiable.
Without a committed metric, a flat overall result invites cherry-picking a segment that happened to move — which is how false findings enter a roadmap.
- Rationale: what evidence motivates the change
- Change: the precise variant being shipped
- Primary metric and expected direction
- Audience: who is eligible for the test
Why fixing the metric matters
Choosing the primary metric before data arrives is the core discipline. If you decide what counts as success only after seeing results, every test 'wins' on something, because random variation guarantees some metric moved. Pre-registration of the metric and direction is the standard guard against this.
Connecting to design choices
The hypothesis drives downstream design: it determines the minimum detectable effect worth chasing, the sample size, and the guardrail metrics you watch for harm. A hypothesis that targets a tiny effect on a low-traffic page may be impractical to test at all — better to learn that before launching.
How it appears in analytics and logs
A vague hypothesis ('let's try a new button') gives no way to interpret a flat or negative result. A specific one names the metric and direction in advance, so the readout is unambiguous.
Diagnostic use case
Write each experiment as 'because [evidence], changing [X] will [increase/decrease] [metric] for [audience]' so the test can falsify a claim, not just observe.
What WebmasterID can help detect
WebmasterID's first-party events let you define the success metric for a hypothesis from real funnel steps, so the metric you commit to is one you can actually measure cleanly.
Common mistakes
- Writing 'let's try X' with no metric or expected direction.
- Choosing the success metric after seeing the results.
- Stating a hypothesis you cannot power given available traffic.
Privacy and accuracy notes
Hypotheses describe aggregate behaviour, not individuals. Designing a test needs no personal data — only the metric definition and the audience segment.
Related pages
- Primary vs secondary metrics in tests
Every experiment should name a single primary metric that determines the decision, and a small set of secondary metrics that add context. The distinction matters statistically: testing many metrics inflates the chance one moves by luck, so the decision must rest on the pre-chosen primary. This page explains the roles and the multiple-comparisons risk.
- Minimum detectable effect (MDE)
The minimum detectable effect (MDE) is the smallest change in your metric that an experiment is set up to detect reliably. It is an input you choose, not an output: a smaller MDE demands more traffic. Setting the MDE to the smallest difference that would actually matter to the business keeps experiments honestly sized.
- Guardrail metrics in experiments
Guardrail metrics are the secondary measures you monitor during an experiment to make sure a change that improves the primary metric does not quietly damage something important — load time, retention, refunds, support load. They turn 'did the target go up' into the fuller question 'did the target go up without breaking anything'.
- Event Explorer
Define a hypothesis metric from real funnel events.
Sources and verification notes
- Wikipedia — Statistical hypothesis testingFalsifiable hypotheses and pre-specified metrics.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.