How to block Diffbot in robots.txt
Diffbot operates a crawler that extracts structured data from web pages to build its Knowledge Graph and power data-extraction APIs. This page gives the robots.txt rule to disallow the Diffbot token and notes that Diffbot documents its crawler and robots.txt support.
What Diffbot is
Diffbot runs a crawler that reads public web pages and extracts structured fields — articles, products, discussions — feeding its Knowledge Graph and developer APIs. Companies use that data for market intelligence, lead data, and AI training inputs. Blocking Diffbot removes your pages from that pipeline.
Diffbot publishes documentation describing its crawler, the robots.txt token, and how it honours disallow rules. Use that documentation rather than guessing the user-agent string.
The rule
To disallow Diffbot site-wide, target its token:
User-agent: Diffbot Disallow: /
The user-agent string contains the Diffbot token plus a self-identifying URL. Match the stable token. Because Diffbot data is sometimes used as an AI-training input, operators who want to limit AI-related crawling often pair this with broader AI-crawler policy. robots.txt is a request, not enforcement.
- Token: Diffbot
- Operator: Diffbot (Knowledge Graph and extraction APIs)
- Often paired with broader AI-crawler policy
How it appears in analytics and logs
A request carrying the Diffbot token is Diffbot's extraction crawler fetching a URL to turn it into structured data. It is a bot event and does not represent an audience or referral.
Diagnostic use case
Disallow Diffbot when you do not want your pages parsed into Diffbot's Knowledge Graph or extraction datasets, which third parties query through its APIs.
What WebmasterID can help detect
WebmasterID classifies Diffbot by its token as a data/extraction crawler, separate from human analytics, so you can verify whether a disallow rule reduced its requests.
Common mistakes
- Assuming Diffbot is a search engine — blocking it does not affect indexing.
- Misspelling the token — it must be exactly Diffbot.
- Inventing IP ranges instead of confirming the effect in logs.
Privacy and accuracy notes
Blocking Diffbot is a publishing-policy choice in a public file. It involves no visitor data and is not access control.
Related pages
- Diffbot — knowledge-graph crawler
Diffbot is a crawler operated by Diffbot that extracts structured data from web pages to build and maintain a knowledge graph. Diffbot documents its crawler and robots.txt token. It identifies itself with the Diffbot token plus a self-identifying URL.
- How to block ImagesiftBot in robots.txt
ImagesiftBot is the crawler associated with ImageSift, a project that indexes images found on the public web. This page gives the robots.txt rule to disallow the ImagesiftBot token and notes that ImageSift documents the crawler and robots.txt support.
- Writing an AI crawler policy for robots.txt
An AI crawler policy is a deliberate decision about which AI-related tokens you allow and which you disallow in robots.txt. This page offers a structured way to make and document those choices, while staying realistic: robots.txt is a request to compliant crawlers, not a legal or technical guarantee.
- Bot intelligence
See whether a Diffbot disallow changed its crawl activity.
Sources and verification notes
- Diffbot — crawler and robots.txt documentationDocuments the Diffbot token and robots.txt handling.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.