Recommendation testing
Recommendation testing compares the algorithms that suggest products or content — related items, 'you may also like', personalised feeds. It is judged on engagement (recommendation click-through), attributed downstream conversion or revenue, and guardrails like diversity and coverage. A central pitfall is the feedback loop: a recommender shapes the very clicks used to train and evaluate it, so offline and online evaluation must be designed carefully.
What to measure
Operational signals include recommendation impressions, click-through on recommended items, and coverage (how much of the catalogue ever gets shown). Outcome signals are the conversions or revenue attributable to a recommendation click. Guardrails matter: a recommender that maximises clicks can collapse into showing the same popular items, hurting diversity and long-term discovery, so track diversity and coverage alongside engagement.
- Engagement: recommendation impressions and click-through
- Outcome: attributed conversion or revenue
- Guardrails: diversity, coverage, catalogue exposure
The feedback-loop trap
A recommender influences which items users see and click, and those clicks often become its next training and evaluation data — a self-reinforcing loop that can make a model look better than it is and entrench popularity bias. Online A/B tests on incremental conversion are the cleaner judge, because they compare against a control that the new model did not shape. Watch for click-through gains that merely cannibalise clicks elsewhere rather than adding incremental value.
Interleaving can compare two recommenders sensitively before a full A/B test.
How it appears in analytics and logs
High recommendation click-through with flat overall conversion can mean the recommender shifts clicks around rather than adding incremental conversions.
Diagnostic use case
A/B test recommender variants on attributed conversion, not click-through alone, and add guardrails so a high-engagement model does not narrow what users see.
What WebmasterID can help detect
WebmasterID's first-party recommendation-slot click and downstream conversion events let you attribute outcomes to each recommender variant.
Common mistakes
- Judging recommenders on click-through instead of incremental conversion.
- Ignoring feedback loops that entrench popularity bias.
- Dropping diversity and coverage guardrails.
Privacy and accuracy notes
Recommenders can rely on behavioural profiles; keep inputs first-party and within consent, and avoid building identifying profiles.
Related pages
- Personalization and conversion
Personalization shows different content to different visitors based on segment, behaviour, or context. It is often assumed to lift conversion, but assumption is not evidence: personalization adds complexity and can backfire, so it must be tested like any other change, against a holdout, on a metric chosen in advance.
- Interleaving experiments
Interleaving compares two ranking algorithms by merging their results into a single list shown to the same user, then crediting whichever ranker contributed the items that were clicked. Because each user sees both rankers' picks side by side, within-user comparison removes between-user noise, making interleaving far more sensitive than splitting users between two whole rankings — widely documented for search and recommendation evaluation.
- Guardrail metrics in experiments
Guardrail metrics are the secondary measures you monitor during an experiment to make sure a change that improves the primary metric does not quietly damage something important — load time, retention, refunds, support load. They turn 'did the target go up' into the fuller question 'did the target go up without breaking anything'.
- Event Explorer
Recommendation-slot clicks and downstream conversions.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.