How to block the SimilarWeb crawler
SimilarWeb operates a crawler that gathers public web data for its market-intelligence and traffic-estimation products. It is a declared crawler with a documented robots.txt token, so operators who do not want their pages crawled for competitive-analytics datasets can disallow it. This page shows the token to target and the rule to use.
What this means
SimilarWeb gathers publicly available web data to power its traffic-estimation and market-intelligence products. Part of that data collection involves a crawler that fetches public pages. If you do not want your content used for competitive-analytics datasets, you can ask the crawler to stay out via robots.txt.
A block does not remove your site from estimates SimilarWeb derives from other data sources; it only asks the crawler to stop fetching your pages directly. robots.txt is a request honoured by compliant crawlers, not an access-control mechanism.
How to block it
Target the SimilarWeb crawler token in its own user-agent group. Match on the stable token rather than a full version string, because the version component changes over time.
User-agent: SimilarWebBot Disallow: /
Place this group alongside your other rules. Because robots.txt is advisory, verify in your logs that requests carrying the token stop after the change; if they continue, the source may be a non-compliant client impersonating the token, which a firewall rule would address instead.
- robots.txt token to target: SimilarWebBot
- User agent contains the token plus a self-identifying SimilarWeb URL
- robots.txt is a request to compliant crawlers, not enforcement
How it appears in analytics and logs
A request carrying the SimilarWebBot token is SimilarWeb's crawler fetching a URL for analytics datasets, not a human visit. Treat it as bot traffic. The user agent is only a claim, so sustained activity from the token is crawl coverage, not audience.
Diagnostic use case
Stop SimilarWeb from crawling your site for its market-intelligence datasets, and confirm in your logs whether the crawler has honoured the rule.
What WebmasterID can help detect
WebmasterID classifies the SimilarWeb crawler server-side and surfaces its activity on the bot-intelligence surface, so you can confirm whether a robots.txt block is being respected without parsing raw server logs.
Common mistakes
- Assuming a robots.txt block removes your site from all SimilarWeb estimates — it only stops direct crawling.
- Counting SimilarWeb crawl hits as human sessions in analytics.
- Expecting robots.txt to enforce the block rather than request compliance.
Privacy and accuracy notes
Blocking SimilarWeb relies only on the request user-agent token. No human identity is involved — a crawler is not a person. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.
Related pages
- How to block AhrefsBot in robots.txt
AhrefsBot is the crawler Ahrefs uses to build its backlink and SEO index. This page gives the robots.txt rule to disallow it and notes that Ahrefs documents support for both robots.txt rules and the crawl-delay directive, so you can slow rather than fully block it.
- How to block SemrushBot in robots.txt
SemrushBot is the crawler Semrush uses to build its SEO datasets. Semrush documents several specialised sub-bots under related tokens, so this page covers the base disallow rule and explains why you may need to target multiple tokens to cover the activity you care about.
- robots.txt vs a firewall/WAF
robots.txt and a firewall/WAF solve different problems: robots.txt politely asks compliant crawlers what to skip, while a firewall or WAF actually blocks requests at the network or edge layer. This page contrasts the two, explains when each is appropriate, and warns against using robots.txt for jobs only enforcement can do.
- Bot intelligence
Deterministic categorisation of crawlers, search bots, and automation.
Sources and verification notes
- SimilarWeb — Help Center: how data is collectedSimilarWeb documents data collection; the exact token should be confirmed against current crawler docs.
- Robots Exclusion Protocol (RFC 9309)Defines how a user-agent group and Disallow rule are interpreted.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.