Interleaving experiments
Interleaving compares two ranking algorithms by merging their results into a single list shown to the same user, then crediting whichever ranker contributed the items that were clicked. Because each user sees both rankers' picks side by side, within-user comparison removes between-user noise, making interleaving far more sensitive than splitting users between two whole rankings — widely documented for search and recommendation evaluation.
Blend two rankings, read the clicks
Given ranker A and ranker B, interleaving constructs one result list by alternating or team-drafting items from each, tracking which ranker contributed each position. The user sees a single combined list. Clicks are then attributed to the contributing ranker, and the side that accrues more clicks is preferred. Because the comparison happens within each user's single session, it controls for the user-to-user variation that dominates a between-user split.
- Merge both rankers' results into one list
- Credit clicks to the contributing ranker
- Within-user comparison cancels between-user noise
Sensitivity and limits
The within-user design is its strength: interleaving has been reported to reach reliable conclusions with far less traffic than equivalent A/B tests, which is why it is common for search and recommendation tuning. Its scope is the limit — it measures relative ranking preference via clicks, not downstream outcomes like conversion or revenue, so teams pair it with an A/B test on business metrics before launch.
Clicks are a proxy; confirm the winner improves the outcome that matters.
How it appears in analytics and logs
A consistent click preference for one ranker's contributed items indicates that ranker is preferred, with less traffic than a between-user test.
Diagnostic use case
Use interleaving to compare two search or recommendation rankers quickly, when a small ranking improvement would be hard to detect with a user-split A/B test.
What WebmasterID can help detect
WebmasterID's first-party click and result-interaction events provide the signal that interleaving credits to each ranker.
Common mistakes
- Treating a click-preference winner as proof of better conversion.
- Using interleaving outside ranking problems it was designed for.
- Skipping a follow-up A/B test on the real business metric.
Privacy and accuracy notes
Interleaving compares rankers using aggregate click preference; no extra personal data beyond normal interaction logging is required.
Related pages
- Search relevance testing
Search relevance testing improves how an internal site search ranks results: query understanding, synonyms, ranking signals, and zero-result handling. It is measured with operational metrics (zero-result rate, click-through on results, search refinements) and outcome metrics (search-to-conversion). Ranking variants are compared with A/B tests on outcomes, or with interleaving for sensitive within-user comparison of rankers.
- Recommendation testing
Recommendation testing compares the algorithms that suggest products or content — related items, 'you may also like', personalised feeds. It is judged on engagement (recommendation click-through), attributed downstream conversion or revenue, and guardrails like diversity and coverage. A central pitfall is the feedback loop: a recommender shapes the very clicks used to train and evaluate it, so offline and online evaluation must be designed carefully.
- Switchback experiments
A switchback experiment randomises treatment at the level of time windows (and sometimes regions) rather than users: the entire system runs control for one interval, treatment for the next, alternating on a schedule. It is used where treating some users affects others — marketplaces, pricing, dispatch — so a user-level split would leak between arms. Time becomes the randomisation unit.
- Event Explorer
Click events that credit each ranker in interleaving.
Sources and verification notes
- Chapelle, Joachims, Radlinski, Yue — Large-scale validation and analysis of interleaved search evaluation (ACM TOIS)Peer-reviewed validation of interleaving's sensitivity vs A/B.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.