Feature flags and experiments
A feature flag is a runtime switch that turns functionality on or off for chosen users without a new deploy. Flags power gradual rollouts, kill switches, and — when the audience is split randomly and outcomes are measured — controlled experiments. Understanding the overlap keeps you from confusing a rollout (operational) with an experiment (measured comparison).
What this means
A feature flag (feature toggle) decouples deploying code from releasing behaviour: the code ships dark and a flag decides who sees it at runtime. Flags serve several jobs — gradual percentage rollouts, instant kill switches, targeting specific segments, and experimentation. When the flag assigns users randomly and you compare a metric between the on and off groups, the flag is delivering an A/B test.
Rollout versus experiment
A rollout and an experiment can use identical flag plumbing but answer different questions. A rollout asks 'can we safely turn this on for everyone?' and ramps the percentage while watching for breakage. An experiment asks 'does this change the metric versus not having it?' and requires random assignment, a control group, a pre-declared metric, and enough sample for a valid comparison.
Conflating them is a common error: ramping a flag to 100% because nothing broke is not evidence the change improved anything. Only the measured comparison gives that.
- Flag = runtime switch, decoupled from deploy
- Rollout: ramp safely; experiment: measure vs control
- Experiment needs random assignment and a metric
How it appears in analytics and logs
A flag that is rolled out to a growing percentage is operational delivery. The same flag with random assignment and a measured outcome against a held-back group is a controlled experiment — the difference is the analysis, not the switch.
Diagnostic use case
Use flags to ship safely and to deliver experiment variants, but only call it an experiment when assignment is random and a metric is compared with a control.
What WebmasterID can help detect
WebmasterID measures the first-party events that tell you what each flagged cohort did, which is the data an experiment built on flags needs to be evaluated.
Common mistakes
- Treating a successful rollout as proof a change worked.
- Skipping random assignment when a flag delivers a variant.
- Leaving stale flags in place, muddying later analysis.
Privacy and accuracy notes
Flag assignment and experiment analysis rely on aggregate cohorts, not personal profiling. This page is educational.
Related pages
- Control and variant in experiments
In an experiment the control is the existing version that acts as the baseline, and the variant is the version carrying the one change you are testing. Comparing the two only yields a clean answer when assignment is random and the variant differs from the control in exactly one way. Multiple variants are possible but each must be isolated.
- Holdout groups
A holdout group is a randomly chosen set of users who are intentionally excluded from one or more shipped changes, so their behaviour serves as a long-run baseline. Where an A/B test measures one change briefly, a holdout measures the combined, sustained effect of everything launched, guarding against the slow accumulation of small regressions or overstated wins.
- A/B testing fundamentals
An A/B test randomly assigns visitors to a control (A) or a variant (B), shows each group one version, and compares a pre-chosen metric. Random assignment is what lets you attribute a difference to the change rather than to who happened to see it. The discipline is in deciding the metric and sample size before you start, not after you peek at the numbers.
- Event Explorer
See what each flagged cohort did via events.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.