SimilarWeb crawler
SimilarWeb is a digital-intelligence company whose crawler fetches publicly accessible web pages as one input to its market-research, traffic-estimation, and competitive-analytics products. It is a data-collection crawler, not a search engine: it gathers signals about websites rather than building a public search index. SimilarWeb publishes a self-identifying crawler user-agent and a page describing the bot so operators can recognise and control it.
What this means
SimilarWeb sells digital market-intelligence: estimated traffic, audience, and competitive metrics for websites and apps. Its crawler is one of many inputs into those estimates, fetching publicly accessible pages to read structure and public signals.
This is not a search engine. SimilarWeb does not publish a consumer search index of your pages; it aggregates data for its analytics customers. Treat it as a research/data-collection crawler distinct from Googlebot or Bingbot.
How it identifies itself
SimilarWeb documents a self-identifying crawler user-agent containing a SimilarWeb token and a URL pointing at its bot documentation. Match on the stable token rather than a full version string, which changes over time.
As with any user-agent, the string is a claim and can be copied. Where authenticity matters, corroborate with request patterns rather than trusting the string alone.
- Operator: SimilarWeb (digital market intelligence)
- User agent contains a SimilarWeb token plus a bot-info URL
- Purpose: data collection for analytics, not public search
robots.txt considerations
SimilarWeb states its crawler honours robots.txt. To disallow it, target the documented SimilarWeb token:
User-agent: SimilarWeb Disallow: /
Use the exact token published in SimilarWeb's bot documentation. robots.txt is a request honoured by compliant crawlers, not an access-control mechanism, and blocking the crawler does not remove your site from any estimates already derived from other inputs.
How it appears in analytics and logs
A request carrying SimilarWeb's crawler identity means a web-intelligence platform fetched your page as one input to its analytics. It is data-collection bot traffic, not a human visit and not a search-index crawl; sustained activity reflects market-research coverage, not audience.
Diagnostic use case
Recognise SimilarWeb's web-intelligence crawler in logs, separate it from search indexing and SEO link crawlers, and decide robots.txt policy for a market-research data collector.
What WebmasterID can help detect
WebmasterID classifies the SimilarWeb crawler server-side as a web-intelligence data collector and surfaces its activity on the bot-intelligence surface, so you can see which pages it reached without parsing raw logs.
Common mistakes
- Treating SimilarWeb's crawler as a search engine that indexes your pages for end users.
- Counting market-intelligence crawl hits as human sessions in analytics.
- Guessing the robots.txt token instead of using SimilarWeb's documented one.
Privacy and accuracy notes
Identification uses only the request user-agent. A crawler is not a person, and no visitor identity is involved. WebmasterID records the fetch as a bot event, separate from human analytics, and never attaches it to a profile.
Related pages
- Web intelligence and traffic crawlers — overview
Web-intelligence and traffic crawlers fetch public pages to build market-research, traffic-estimation, and internet-measurement datasets rather than to power consumer search. This overview explains how to recognise them, why they are distinct from search and SEO crawlers, and how to set policy. They build private analytics or research datasets, so their crawling reflects measurement coverage rather than audience.
- Netcraft survey crawler
Netcraft is a security and internet-research company known for its long-running Web Server Survey, which measures the software, hosting, and configuration of public web servers across the internet. Its crawler fetches public endpoints to record server signals rather than to index page content for search. It appears in logs as periodic survey probes associated with Netcraft's research and anti-phishing operations.
- Search crawlers vs SEO crawlers
Search-engine crawlers like Googlebot and Bingbot build the indexes that determine search visibility. Third-party SEO crawlers like AhrefsBot and SemrushBot feed analysis tools and do not affect rankings directly. Distinguishing them matters for crawl-budget reasoning and for deciding what to allow or limit.
- Bot intelligence
Deterministic categorisation of crawlers, search bots, and data collectors.
Sources and verification notes
- SimilarWeb — crawler / bot informationSelf-identifying crawler user-agent and robots.txt guidance documented.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.