Multi-armed bandit testing
A multi-armed bandit is an adaptive allocation strategy that sends more traffic to variants that look better as data accumulates, instead of a fixed split. It minimises 'regret' — lost conversions from showing inferior options — but the moving allocation complicates classic inference. Bandits suit ongoing optimisation more than one-off learning. This page explains the trade-off honestly.
Explore versus exploit
The bandit framing comes from a gambler facing slot machines ('one-armed bandits') with unknown payouts. Every pull is a choice between exploring (learning which arm is best) and exploiting (playing the arm that looks best so far). Adaptive allocation tilts toward exploitation as evidence grows, reducing conversions lost to weak variants.
What you gain and give up
The gain is lower regret: fewer users see the losing variant while the test runs, which matters when traffic is valuable. The cost is statistical: because allocation changes with the data, the clean confidence intervals of a fixed A/B test no longer apply directly, and estimating a precise effect size for each arm is harder.
- Lower regret during the test
- Adaptive split, not a fixed ratio
- Harder to extract a clean effect estimate
When a bandit fits
Bandits suit short-lived or continuous-optimisation problems — headlines, promotions, layouts where you mainly want to serve the best option soon. When the goal is a durable, well-estimated learning about an effect (to inform a strategy), a fixed-horizon A/B test is usually the cleaner instrument.
How it appears in analytics and logs
Shifting traffic shares are expected behaviour for a bandit, not a bug. But the adaptive split means you cannot read it like a fixed 50/50 A/B test.
Diagnostic use case
Use a bandit when the goal is to maximise conversions during the test (exploit the winner fast), rather than to obtain a clean, fixed-horizon estimate of the effect size.
What WebmasterID can help detect
WebmasterID's first-party per-variant conversion events give a bandit the aggregate signal it needs, and let you audit the realised allocation over time.
Common mistakes
- Reading bandit arms as if they were a fixed 50/50 A/B test.
- Using a bandit when you need a precise effect-size estimate.
- Ignoring that early allocation is noisy before evidence accrues.
Privacy and accuracy notes
A bandit allocates by performance, not by identity. It can run on aggregate per-arm rates without storing personal data about who saw which variant.
Related pages
- Contextual bandit optimisation
A contextual bandit extends the bandit idea by conditioning the choice of variant on context — features available at decision time, such as device or referrer. It learns a policy that maps context to the option likely to convert, allowing per-segment personalisation. This raises the same inference caveats as bandits plus risks around the context features used. This page covers both.
- A/B testing fundamentals
An A/B test randomly assigns visitors to a control (A) or a variant (B), shows each group one version, and compares a pre-chosen metric. Random assignment is what lets you attribute a difference to the change rather than to who happened to see it. The discipline is in deciding the metric and sample size before you start, not after you peek at the numbers.
- Bayesian A/B testing
Bayesian A/B testing treats the conversion rate of each arm as an unknown with a probability distribution. It combines a prior belief with observed data to produce a posterior, from which you can state things like 'the probability that B beats A is high' and quantify the expected loss of choosing wrong. It is an alternative framing to the frequentist p-value, with different assumptions rather than a guarantee of more truth.
- Traffic allocation in experiments
Traffic allocation decides what fraction of eligible users enter an experiment and how that fraction divides among variants. A 50/50 split between two arms maximises statistical power for a fixed sample; ramping exposure limits blast radius. Allocation is a deliberate trade-off between speed, risk, and the number of variants. This page explains the levers.
Sources and verification notes
- Wikipedia — Multi-armed banditExplore/exploit trade-off and regret minimisation.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.