AI crawlers

Diffbot — knowledge-graph crawler

Diffbot is a crawler operated by Diffbot that extracts structured data from web pages to build and maintain a knowledge graph. Diffbot documents its crawler and robots.txt token. It identifies itself with the Diffbot token plus a self-identifying URL.

Verified against primary sources

What this means

Diffbot is a crawler that extracts structured data from web pages to build a knowledge graph — a machine-readable model of entities and their relationships. Rather than simply storing page text, it parses pages into structured records.

Diffbot documents the crawler and the robots.txt token operators can use to control it. Allowing Diffbot lets it extract structured data from your public pages; disallowing it asks the crawler to stay out.

How Diffbot identifies itself

Diffbot uses the robots.txt user-agent token Diffbot. Its user-agent string contains that token together with a self-identifying URL pointing at Diffbot's documentation. Match on the stable token rather than a full version string.

The user agent is a claim and can be copied. Use Diffbot's published guidance where authenticity matters, and do not invent IP ranges.

robots.txt token: Diffbot
Extracts structured data to build a knowledge graph
User agent contains the token plus a Diffbot URL

robots.txt considerations

Diffbot honours robots.txt. To disallow it site-wide:

User-agent: Diffbot Disallow: /

This targets only Diffbot. robots.txt is a request honoured by compliant crawlers, not an access-control boundary.

How it appears in analytics and logs

A request carrying the Diffbot token is Diffbot's crawler fetching a URL to extract structured data — a bot event, not a human visit. Treat sustained Diffbot activity as data-extraction crawl coverage rather than audience growth.

Diagnostic use case

Confirm whether Diffbot has crawled a page for structured-data extraction and set robots.txt policy for it.

What WebmasterID can help detect

WebmasterID classifies Diffbot server-side by its token and surfaces its activity on the bot-intelligence surface, so you can see structured-data crawl coverage of your pages without parsing logs.

Common mistakes

Trusting the Diffbot user agent without verification where authenticity matters.
Counting Diffbot extraction crawls as human traffic.
Expecting robots.txt to enforce access rather than request compliance.

Privacy and accuracy notes

Detection uses only the request user-agent. No human identity is involved — a crawler is not a person. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.

↑ All AI crawlers in AI crawlers

Sources and verification notes

Diffbot — crawler documentationDocuments the Diffbot crawler, its token, and robots.txt handling.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.