How to block CCBot (Common Crawl)
CCBot is the crawler operated by Common Crawl, a non-profit that publishes a large open web-crawl dataset reused by many downstream projects, including some AI training pipelines. This page gives the robots.txt rule to disallow CCBot and explains why blocking it affects that dataset specifically.
What CCBot is
CCBot is the crawler for Common Crawl, a non-profit that builds and freely publishes a large corpus of web data. Because the corpus is open, it is reused widely — for research and, in some cases, as an input to AI training pipelines. Blocking CCBot therefore reduces your presence in that upstream dataset rather than blocking any single AI company directly.
The rule
Common Crawl documents that CCBot honours robots.txt. To disallow it site-wide:
User-agent: CCBot Disallow: /
As with all robots.txt rules, this is honoured by the compliant crawler but cannot force compliance, and it does not remove data already collected and published in earlier crawl snapshots.
- Token: CCBot
- Affects the Common Crawl open dataset
- Does not retroactively delete past snapshots
How it appears in analytics and logs
A request carrying the CCBot token is Common Crawl fetching a URL for its public dataset. Blocking it asks that crawler to stop; it does not retroactively remove already-published crawl data.
Diagnostic use case
Keep your content out of the Common Crawl dataset by disallowing CCBot, reducing one common upstream source for downstream reuse.
What WebmasterID can help detect
WebmasterID classifies CCBot as a crawler distinct from human traffic, so you can see its activity and confirm a block took effect for the compliant crawler.
Common mistakes
- Assuming blocking CCBot blocks every AI crawler — it only affects Common Crawl.
- Expecting it to delete data already published in past crawl archives.
- Misspelling the token — it must be exactly CCBot.
Privacy and accuracy notes
Blocking CCBot is a publishing-policy choice expressed in a public file. It involves no visitor data.
Related pages
- Writing an AI crawler policy for robots.txt
An AI crawler policy is a deliberate decision about which AI-related tokens you allow and which you disallow in robots.txt. This page offers a structured way to make and document those choices, while staying realistic: robots.txt is a request to compliant crawlers, not a legal or technical guarantee.
- How to block GPTBot in robots.txt
If you do not want OpenAI's training crawler fetching your site, you can disallow GPTBot in robots.txt. This page gives the exact rule, clarifies that it does not affect ChatGPT-User or OAI-SearchBot, and is honest about the limits of robots-based blocking.
- Bot intelligence
See CCBot activity separate from human traffic.
Sources and verification notes
- Common Crawl — CCBot and robots.txtDocuments the CCBot token and robots.txt handling.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.