Primary vs secondary metrics in tests
Every experiment should name a single primary metric that determines the decision, and a small set of secondary metrics that add context. The distinction matters statistically: testing many metrics inflates the chance one moves by luck, so the decision must rest on the pre-chosen primary. This page explains the roles and the multiple-comparisons risk.
The roles are different
The primary metric answers a single question: should we ship this change? It is chosen before the test and is the only metric that, on its own, justifies the decision. Secondary metrics explain the why — they show whether the change helped, where, and at what cost — but they do not get a vote on the binary ship decision.
Why the split controls error
Each metric you test is a chance to find a 'significant' move by luck. Test ten metrics at a 5% threshold and you expect roughly one false positive even when nothing changed. Designating one primary metric in advance keeps the decision honest; secondary metrics are read descriptively, not as independent significance tests.
- Primary: pre-chosen, decides ship/no-ship
- Secondary: context and diagnosis, not a verdict
- Guardrails: a special class watched for harm
Guardrails are a special case
Guardrail metrics sit alongside secondary metrics but with a defensive job: they flag when a change improves the primary at an unacceptable cost elsewhere (latency, revenue, complaints). A change that wins on the primary but breaches a guardrail should not ship unchanged.
How it appears in analytics and logs
If a test 'won' on a secondary metric while the primary was flat, that is usually noise from many comparisons — not a result to ship on.
Diagnostic use case
Pick one primary metric to drive the ship decision before launch; treat secondary metrics as supporting evidence and guardrails, not as additional ways to declare victory.
What WebmasterID can help detect
WebmasterID lets you instrument primary and secondary metrics from the same first-party event stream, so the decision metric and its context come from one consistent source.
Common mistakes
- Declaring a win on whichever metric happened to move.
- Running significance tests on every metric and shipping on any hit.
- Forgetting guardrails when the primary metric improves.
Privacy and accuracy notes
Metrics are aggregate rates over a cohort. Separating primary from secondary is a measurement decision and needs no personal identifiers.
Related pages
- Designing an experiment hypothesis
Before running an A/B test you write a hypothesis: a falsifiable statement linking a specific change to an expected effect on a named metric, for a defined audience, with a rationale. A good hypothesis fixes the success metric in advance, which prevents post-hoc metric shopping. This page covers the structure of a hypothesis and the reasoning behind it.
- Guardrail metrics in experiments
Guardrail metrics are the secondary measures you monitor during an experiment to make sure a change that improves the primary metric does not quietly damage something important — load time, retention, refunds, support load. They turn 'did the target go up' into the fuller question 'did the target go up without breaking anything'.
- North star metric
A north star metric is the one measure a team chooses to represent the core value it delivers, used to align decisions. Its value is focus: a single shared metric stops teams optimising in different directions. Its risk is tunnel vision — any single metric can be gamed, so it needs guardrail metrics around it and a clear link to real value.
- P-value misconceptions
The p-value is one of the most misread numbers in experimentation. It is the probability of seeing data at least as extreme as observed if the null hypothesis were true — not the probability the null is true, not the probability of a fluke, and not a measure of effect size. The American Statistical Association issued a formal statement listing exactly these misconceptions.
Sources and verification notes
- Wikipedia — Multiple comparisons problemWhy testing many metrics inflates false positives.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.