WebmasterID logoWebmasterID
Robots & crawl control

How to block CCBot (Common Crawl)

CCBot is the crawler operated by Common Crawl, a non-profit that publishes a large open web-crawl dataset reused by many downstream projects, including some AI training pipelines. This page gives the robots.txt rule to disallow CCBot and explains why blocking it affects that dataset specifically.

Verified against primary sources

What CCBot is

CCBot is the crawler for Common Crawl, a non-profit that builds and freely publishes a large corpus of web data. Because the corpus is open, it is reused widely — for research and, in some cases, as an input to AI training pipelines. Blocking CCBot therefore reduces your presence in that upstream dataset rather than blocking any single AI company directly.

The rule

Common Crawl documents that CCBot honours robots.txt. To disallow it site-wide:

User-agent: CCBot Disallow: /

As with all robots.txt rules, this is honoured by the compliant crawler but cannot force compliance, and it does not remove data already collected and published in earlier crawl snapshots.

How it appears in analytics and logs

A request carrying the CCBot token is Common Crawl fetching a URL for its public dataset. Blocking it asks that crawler to stop; it does not retroactively remove already-published crawl data.

Diagnostic use case

Keep your content out of the Common Crawl dataset by disallowing CCBot, reducing one common upstream source for downstream reuse.

What WebmasterID can help detect

WebmasterID classifies CCBot as a crawler distinct from human traffic, so you can see its activity and confirm a block took effect for the compliant crawler.

Common mistakes

Privacy and accuracy notes

Blocking CCBot is a publishing-policy choice expressed in a public file. It involves no visitor data.

Related pages

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.