Parsing user agents with regex pitfalls
Writing your own regular expressions to parse user-agent strings is fragile: strings carry overlapping legacy tokens, Chromium browsers share Chrome and Safari tokens, and new browsers appear constantly. Hand-rolled patterns produce false matches and silently rot. A maintained user-agent parser, Client Hints, or feature detection are more durable approaches.
What this means
A user-agent string is not a clean, structured field. It carries legacy tokens (Mozilla, KHTML), shared engine names (every Chromium browser has Chrome and Safari tokens), and a long tail of niche and embedded clients. A regex that looks reasonable for the top browsers quietly mismatches the rest.
The classic failures: matching Safari before Chrome and labelling Chrome as Safari; matching a substring that also appears in another browser; and bucketing every new or niche browser as Other because the pattern never anticipated it.
Why DIY regex rots
User agents change continuously — new browsers, new versions, UA reduction trimming detail. A static pattern set cannot keep up without ongoing maintenance, and the breakage is silent: numbers still appear, they are just wrong. Order-of-match bugs are especially common because token overlap means sequence matters.
Prefer a maintained, community-updated user-agent parser that already handles the overlap and the long tail. Better still, where you only need OS family or form factor, read low-entropy Client Hints; and where you need a capability, use feature detection instead of parsing at all.
- Match order matters: check specific tokens before generic ones
- Shared Chromium tokens cause Chrome-as-Safari false matches
- Static patterns rot silently as new browsers appear
How it appears in analytics and logs
Misclassified browser/OS data in analytics often traces back to brittle regex — for example Chrome counted as Safari, or a new browser bucketed as Other — because the pattern did not account for shared tokens and evolving strings.
Diagnostic use case
Avoid the common failure modes of DIY user-agent regex and choose a more robust approach to extracting browser, OS, or device context.
What WebmasterID can help detect
WebmasterID classifies user agents with maintained logic and matches stable, specific product tokens, avoiding the false positives that naive regex produces.
Common mistakes
- Matching the Safari token before the Chrome token and mislabelling Chrome.
- Treating the user agent as a structured field instead of overlapping tokens.
- Shipping a static regex set with no plan to maintain it.
Privacy and accuracy notes
This is about parsing technique, not visitor data. Whatever the method, keep extraction coarse — browser/OS family and form factor — and avoid assembling high-entropy detail into identifiers.
Frequently asked questions
- Is it ever fine to parse user agents with regex?
- For a narrow, well-tested match on a single stable product token it can be fine. For general browser/OS/device classification, a maintained parser, Client Hints, or feature detection are far more reliable than hand-rolled patterns.
Related pages
- How to parse user agents safely
Parsing user agents by hand with regular expressions is fragile and breaks as strings evolve. The safer approach is to use a maintained UA library, store a coarse category rather than each visitor's raw string, and treat the result as a hint, not an identity. This page sets out a privacy-safe parsing approach.
- User agent sniffing pitfalls
User-agent sniffing means changing site behaviour based on substrings in the User-Agent header. It is fragile: it misfires on new or unexpected browsers, breaks as user agents are reduced, and is easily defeated by spoofing. Feature detection and Client Hints are more robust approaches for most cases.
- User agent history and evolution
Modern user-agent strings are stuffed with historical tokens — Mozilla/5.0, AppleWebKit, KHTML, like Gecko, Safari — that no longer mean what they say. They accumulated as browsers copied each other's tokens to pass server-side sniffing. Understanding this history explains why today's strings are misleading and why feature detection and Client Hints are preferred.
- WebmasterID docs
How user-agent classification is kept robust against token overlap.
Sources and verification notes
- MDN — Browser detection using the user agent (discouraged)Documents why UA parsing is fragile; specific regex failures described from common practice.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.