AI crawlers

CCBot — Common Crawl crawler

CCBot is the crawler operated by Common Crawl to build its open, freely available web dataset. That dataset is widely reused as a training source by many AI projects. Common Crawl documents the crawler and its robots.txt token, and CCBot honours robots.txt.

Verified against primary sources

What this means

CCBot is the crawler Common Crawl uses to build its open web dataset. Common Crawl publishes that corpus for free, and it has become one of the most widely reused training sources across the AI ecosystem.

That reuse matters: allowing CCBot does not just feed one product. Because the resulting dataset is openly redistributed, your content may reach many downstream consumers. Disallowing CCBot asks Common Crawl not to include your pages in that corpus.

How CCBot identifies itself

CCBot uses the robots.txt user-agent token CCBot. Its user-agent string contains the CCBot token together with a self-identifying URL pointing at Common Crawl. Match on the stable token rather than a full version string.

The user agent is a claim and can be copied, so for requests that must be trusted, rely on Common Crawl's published guidance rather than the user agent alone.

robots.txt token: CCBot
User agent contains the CCBot token plus a Common Crawl URL
Feeds an open dataset reused widely as a training source

robots.txt considerations

CCBot honours robots.txt. To disallow it site-wide:

User-agent: CCBot Disallow: /

This asks Common Crawl to exclude your site from its dataset going forward. robots.txt is a request honoured by compliant crawlers, not an enforcement boundary, and it does not retroactively remove content already published in past dataset releases.

How it appears in analytics and logs

A request carrying the CCBot token is Common Crawl's crawler fetching a URL for its public dataset — a bot event, not a human visit. Because that dataset is reused for AI training, allowing CCBot effectively makes content available to many downstream consumers.

Diagnostic use case

Confirm whether CCBot has fetched a page for the Common Crawl dataset, and set robots.txt policy for it if you do not want your content in that open corpus.

What WebmasterID can help detect

WebmasterID classifies CCBot server-side as an AI/dataset crawler and surfaces its activity on the bot-intelligence and AI-visibility surfaces, so you can see Common Crawl coverage of your pages without parsing logs.

Common mistakes

Assuming blocking CCBot retroactively removes your content from older Common Crawl releases — it applies going forward.
Counting CCBot dataset crawls as human traffic.
Overlooking that one open dataset feeds many downstream AI consumers.

Privacy and accuracy notes

Detection uses only the request user-agent. No human identity is involved — a crawler is not a person. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.

↑ All AI crawlers in AI crawlers

Sources and verification notes

Common Crawl — CCBot documentationDocuments CCBot, its token, and robots.txt handling.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.