Robots & crawl control

How to block the SimilarWeb crawler

SimilarWeb operates a crawler that gathers public web data for its market-intelligence and traffic-estimation products. It is a declared crawler with a documented robots.txt token, so operators who do not want their pages crawled for competitive-analytics datasets can disallow it. This page shows the token to target and the rule to use.

Partially verified

What this means

SimilarWeb gathers publicly available web data to power its traffic-estimation and market-intelligence products. Part of that data collection involves a crawler that fetches public pages. If you do not want your content used for competitive-analytics datasets, you can ask the crawler to stay out via robots.txt.

A block does not remove your site from estimates SimilarWeb derives from other data sources; it only asks the crawler to stop fetching your pages directly. robots.txt is a request honoured by compliant crawlers, not an access-control mechanism.

How to block it

Target the SimilarWeb crawler token in its own user-agent group. Match on the stable token rather than a full version string, because the version component changes over time.

User-agent: SimilarWebBot Disallow: /

Place this group alongside your other rules. Because robots.txt is advisory, verify in your logs that requests carrying the token stop after the change; if they continue, the source may be a non-compliant client impersonating the token, which a firewall rule would address instead.

robots.txt token to target: SimilarWebBot
User agent contains the token plus a self-identifying SimilarWeb URL
robots.txt is a request to compliant crawlers, not enforcement

How it appears in analytics and logs

A request carrying the SimilarWebBot token is SimilarWeb's crawler fetching a URL for analytics datasets, not a human visit. Treat it as bot traffic. The user agent is only a claim, so sustained activity from the token is crawl coverage, not audience.

Diagnostic use case

Stop SimilarWeb from crawling your site for its market-intelligence datasets, and confirm in your logs whether the crawler has honoured the rule.

What WebmasterID can help detect

WebmasterID classifies the SimilarWeb crawler server-side and surfaces its activity on the bot-intelligence surface, so you can confirm whether a robots.txt block is being respected without parsing raw server logs.

Common mistakes

Assuming a robots.txt block removes your site from all SimilarWeb estimates — it only stops direct crawling.
Counting SimilarWeb crawl hits as human sessions in analytics.
Expecting robots.txt to enforce the block rather than request compliance.

Privacy and accuracy notes

Blocking SimilarWeb relies only on the request user-agent token. No human identity is involved — a crawler is not a person. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.

↑ All robots topics in Robots & crawl control

Sources and verification notes

SimilarWeb — Help Center: how data is collectedSimilarWeb documents data collection; the exact token should be confirmed against current crawler docs.
Robots Exclusion Protocol (RFC 9309)Defines how a user-agent group and Disallow rule are interpreted.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.