Robots & crawl control

How to block CCBot (Common Crawl)

CCBot is the crawler operated by Common Crawl, a non-profit that publishes a large open web-crawl dataset reused by many downstream projects, including some AI training pipelines. This page gives the robots.txt rule to disallow CCBot and explains why blocking it affects that dataset specifically.

Verified against primary sources

What CCBot is

CCBot is the crawler for Common Crawl, a non-profit that builds and freely publishes a large corpus of web data. Because the corpus is open, it is reused widely — for research and, in some cases, as an input to AI training pipelines. Blocking CCBot therefore reduces your presence in that upstream dataset rather than blocking any single AI company directly.

The rule

Common Crawl documents that CCBot honours robots.txt. To disallow it site-wide:

User-agent: CCBot Disallow: /

As with all robots.txt rules, this is honoured by the compliant crawler but cannot force compliance, and it does not remove data already collected and published in earlier crawl snapshots.

Token: CCBot
Affects the Common Crawl open dataset
Does not retroactively delete past snapshots

How it appears in analytics and logs

A request carrying the CCBot token is Common Crawl fetching a URL for its public dataset. Blocking it asks that crawler to stop; it does not retroactively remove already-published crawl data.

Diagnostic use case

Keep your content out of the Common Crawl dataset by disallowing CCBot, reducing one common upstream source for downstream reuse.

What WebmasterID can help detect

WebmasterID classifies CCBot as a crawler distinct from human traffic, so you can see its activity and confirm a block took effect for the compliant crawler.

Common mistakes

Assuming blocking CCBot blocks every AI crawler — it only affects Common Crawl.
Expecting it to delete data already published in past crawl archives.
Misspelling the token — it must be exactly CCBot.

Privacy and accuracy notes

Blocking CCBot is a publishing-policy choice expressed in a public file. It involves no visitor data.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Common Crawl — CCBot and robots.txtDocuments the CCBot token and robots.txt handling.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.