CCBot — Common Crawl crawler
CCBot is the crawler operated by Common Crawl to build its open, freely available web dataset. That dataset is widely reused as a training source by many AI projects. Common Crawl documents the crawler and its robots.txt token, and CCBot honours robots.txt.
What this means
CCBot is the crawler Common Crawl uses to build its open web dataset. Common Crawl publishes that corpus for free, and it has become one of the most widely reused training sources across the AI ecosystem.
That reuse matters: allowing CCBot does not just feed one product. Because the resulting dataset is openly redistributed, your content may reach many downstream consumers. Disallowing CCBot asks Common Crawl not to include your pages in that corpus.
How CCBot identifies itself
CCBot uses the robots.txt user-agent token CCBot. Its user-agent string contains the CCBot token together with a self-identifying URL pointing at Common Crawl. Match on the stable token rather than a full version string.
The user agent is a claim and can be copied, so for requests that must be trusted, rely on Common Crawl's published guidance rather than the user agent alone.
- robots.txt token: CCBot
- User agent contains the CCBot token plus a Common Crawl URL
- Feeds an open dataset reused widely as a training source
robots.txt considerations
CCBot honours robots.txt. To disallow it site-wide:
User-agent: CCBot Disallow: /
This asks Common Crawl to exclude your site from its dataset going forward. robots.txt is a request honoured by compliant crawlers, not an enforcement boundary, and it does not retroactively remove content already published in past dataset releases.
How it appears in analytics and logs
A request carrying the CCBot token is Common Crawl's crawler fetching a URL for its public dataset — a bot event, not a human visit. Because that dataset is reused for AI training, allowing CCBot effectively makes content available to many downstream consumers.
Diagnostic use case
Confirm whether CCBot has fetched a page for the Common Crawl dataset, and set robots.txt policy for it if you do not want your content in that open corpus.
What WebmasterID can help detect
WebmasterID classifies CCBot server-side as an AI/dataset crawler and surfaces its activity on the bot-intelligence and AI-visibility surfaces, so you can see Common Crawl coverage of your pages without parsing logs.
Common mistakes
- Assuming blocking CCBot retroactively removes your content from older Common Crawl releases — it applies going forward.
- Counting CCBot dataset crawls as human traffic.
- Overlooking that one open dataset feeds many downstream AI consumers.
Privacy and accuracy notes
Detection uses only the request user-agent. No human identity is involved — a crawler is not a person. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.
Related pages
- GPTBot — OpenAI's web crawler
GPTBot is the crawler OpenAI uses to fetch publicly available web content that may be used to help train its foundation models. It is a declared, well-documented crawler with a stable robots.txt token, and OpenAI publishes both documentation and an IP range list so operators can identify and control it.
- AI2Bot — Allen Institute for AI crawler
AI2Bot is the crawler operated by the Allen Institute for AI (AI2) to gather web data for its datasets and research. AI2 documents the crawler and its robots.txt token. Where a specific is not clearly covered it is marked partially verified rather than guessed.
- Bot intelligence
Deterministic categorisation of crawlers, search bots, and automation.
Sources and verification notes
- Common Crawl — CCBot documentationDocuments CCBot, its token, and robots.txt handling.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.