Switchback experiments
A switchback experiment randomises treatment at the level of time windows (and sometimes regions) rather than users: the entire system runs control for one interval, treatment for the next, alternating on a schedule. It is used where treating some users affects others — marketplaces, pricing, dispatch — so a user-level split would leak between arms. Time becomes the randomisation unit.
Time as the unit of randomisation
Instead of assigning each visitor to a variant, a switchback flips the whole system between control and treatment on a schedule — for example, alternating every 30 minutes, sometimes crossed with region. All traffic in a treatment window sees treatment; all traffic in a control window sees control. Comparing aggregate outcomes across the two sets of windows estimates the effect at the system level.
- Randomise time windows (sometimes × region), not users
- Whole system runs one variant per window
- Designed for interference between users
When and why
Switchbacks address interference: when treating some users alters the experience of others, a user-level A/B test violates the no-spillover assumption and biases results. This is common in marketplaces (shared supply), dynamic pricing, and matching systems. The design's own risk is carryover — effects from one window bleeding into the next — handled with burn-in periods and randomised window order. Fewer effective units (windows) than users usually means lower power.
It is the time-based answer to network effects in experiments.
How it appears in analytics and logs
Effects are estimated by comparing treatment windows to control windows; carryover between adjacent windows can bias the estimate if not handled.
Diagnostic use case
Use switchbacks when treating one user changes the experience of others (shared inventory, pricing, matching), breaking the independence a user-split assumes.
What WebmasterID can help detect
WebmasterID's time-stamped first-party events let you attribute conversions to the treatment window that was active when they occurred.
Common mistakes
- Using user-level A/B tests where treatment spills between users.
- Ignoring carryover between adjacent windows.
- Underestimating the power cost of few time-window units.
Privacy and accuracy notes
Switchbacks randomise time or region, not identifiable individuals, which can reduce per-user data needs.
Related pages
- Network effects in experiments
Standard A/B tests assume each user's outcome depends only on their own assigned variant — the no-interference (SUTVA) assumption. Network effects break it: in social products, marketplaces, or anything with sharing, a treated user changes the experience of untreated users, so control is 'contaminated' and the measured effect is biased. Cluster, switchback, or ego-network designs reduce the leakage.
- Randomization unit
The randomization unit is the thing you randomly assign to control or treatment: a user, a session, a device, a cookie, or a cluster. The choice must match how you analyse and how users experience the change. Mismatches cause two classic failures — a user flipping variants between sessions (inconsistent experience) and analysing at a finer grain than you assigned (understated variance, false significance).
- Interleaving experiments
Interleaving compares two ranking algorithms by merging their results into a single list shown to the same user, then crediting whichever ranker contributed the items that were clicked. Because each user sees both rankers' picks side by side, within-user comparison removes between-user noise, making interleaving far more sensitive than splitting users between two whole rankings — widely documented for search and recommendation evaluation.
- Event Explorer
Time-stamped events to attribute to active windows.
Sources and verification notes
- Bojinov, Simchi-Levi, Zhao — Design and Analysis of Switchback Experiments (working paper)Peer-reviewed methodology for switchback design and estimation.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.