Robots & crawl control

How to block Diffbot in robots.txt

Diffbot operates a crawler that extracts structured data from web pages to build its Knowledge Graph and power data-extraction APIs. This page gives the robots.txt rule to disallow the Diffbot token and notes that Diffbot documents its crawler and robots.txt support.

Verified against primary sources

What Diffbot is

Diffbot runs a crawler that reads public web pages and extracts structured fields — articles, products, discussions — feeding its Knowledge Graph and developer APIs. Companies use that data for market intelligence, lead data, and AI training inputs. Blocking Diffbot removes your pages from that pipeline.

Diffbot publishes documentation describing its crawler, the robots.txt token, and how it honours disallow rules. Use that documentation rather than guessing the user-agent string.

The rule

To disallow Diffbot site-wide, target its token:

User-agent: Diffbot Disallow: /

The user-agent string contains the Diffbot token plus a self-identifying URL. Match the stable token. Because Diffbot data is sometimes used as an AI-training input, operators who want to limit AI-related crawling often pair this with broader AI-crawler policy. robots.txt is a request, not enforcement.

Token: Diffbot
Operator: Diffbot (Knowledge Graph and extraction APIs)
Often paired with broader AI-crawler policy

How it appears in analytics and logs

A request carrying the Diffbot token is Diffbot's extraction crawler fetching a URL to turn it into structured data. It is a bot event and does not represent an audience or referral.

Diagnostic use case

Disallow Diffbot when you do not want your pages parsed into Diffbot's Knowledge Graph or extraction datasets, which third parties query through its APIs.

What WebmasterID can help detect

WebmasterID classifies Diffbot by its token as a data/extraction crawler, separate from human analytics, so you can verify whether a disallow rule reduced its requests.

Common mistakes

Assuming Diffbot is a search engine — blocking it does not affect indexing.
Misspelling the token — it must be exactly Diffbot.
Inventing IP ranges instead of confirming the effect in logs.

Privacy and accuracy notes

Blocking Diffbot is a publishing-policy choice in a public file. It involves no visitor data and is not access control.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Diffbot — crawler and robots.txt documentationDocuments the Diffbot token and robots.txt handling.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.