Diffbot — knowledge-graph crawler
Diffbot is a crawler operated by Diffbot that extracts structured data from web pages to build and maintain a knowledge graph. Diffbot documents its crawler and robots.txt token. It identifies itself with the Diffbot token plus a self-identifying URL.
What this means
Diffbot is a crawler that extracts structured data from web pages to build a knowledge graph — a machine-readable model of entities and their relationships. Rather than simply storing page text, it parses pages into structured records.
Diffbot documents the crawler and the robots.txt token operators can use to control it. Allowing Diffbot lets it extract structured data from your public pages; disallowing it asks the crawler to stay out.
How Diffbot identifies itself
Diffbot uses the robots.txt user-agent token Diffbot. Its user-agent string contains that token together with a self-identifying URL pointing at Diffbot's documentation. Match on the stable token rather than a full version string.
The user agent is a claim and can be copied. Use Diffbot's published guidance where authenticity matters, and do not invent IP ranges.
- robots.txt token: Diffbot
- Extracts structured data to build a knowledge graph
- User agent contains the token plus a Diffbot URL
robots.txt considerations
Diffbot honours robots.txt. To disallow it site-wide:
User-agent: Diffbot Disallow: /
This targets only Diffbot. robots.txt is a request honoured by compliant crawlers, not an access-control boundary.
How it appears in analytics and logs
A request carrying the Diffbot token is Diffbot's crawler fetching a URL to extract structured data — a bot event, not a human visit. Treat sustained Diffbot activity as data-extraction crawl coverage rather than audience growth.
Diagnostic use case
Confirm whether Diffbot has crawled a page for structured-data extraction and set robots.txt policy for it.
What WebmasterID can help detect
WebmasterID classifies Diffbot server-side by its token and surfaces its activity on the bot-intelligence surface, so you can see structured-data crawl coverage of your pages without parsing logs.
Common mistakes
- Trusting the Diffbot user agent without verification where authenticity matters.
- Counting Diffbot extraction crawls as human traffic.
- Expecting robots.txt to enforce access rather than request compliance.
Privacy and accuracy notes
Detection uses only the request user-agent. No human identity is involved — a crawler is not a person. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.
Related pages
- AI2Bot — Allen Institute for AI crawler
AI2Bot is the crawler operated by the Allen Institute for AI (AI2) to gather web data for its datasets and research. AI2 documents the crawler and its robots.txt token. Where a specific is not clearly covered it is marked partially verified rather than guessed.
- CCBot — Common Crawl crawler
CCBot is the crawler operated by Common Crawl to build its open, freely available web dataset. That dataset is widely reused as a training source by many AI projects. Common Crawl documents the crawler and its robots.txt token, and CCBot honours robots.txt.
- Bot intelligence
Deterministic categorisation of crawlers, search bots, and automation.
Sources and verification notes
- Diffbot — crawler documentationDocuments the Diffbot crawler, its token, and robots.txt handling.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.