How do I tell a transparent AI crawler from an undeclared scraper?

A transparent crawler declares a stable token, carries a user-agent linking to operator docs, publishes a network source you can verify, and states its purpose. An undeclared scraper offers none of these, so you can only judge it by request behaviour.

AI crawlers

AI crawler disclosure and transparency

A transparent AI crawler is one you can identify and reason about: it declares a stable robots.txt token, carries a self-identifying user-agent pointing at operator documentation, publishes its network source so you can verify it, and states what it fetches content for. Disclosure is what separates a crawler you can set policy on from an undeclared scraper you can only detect by behaviour.

Verified against primary sources

What disclosure looks like

A disclosed AI crawler gives you four things you can check. First, a stable robots.txt token so you can address it specifically. Second, a self-identifying user-agent that includes a URL pointing back to the operator's crawler documentation. Third, a published network source — an IP range list or a verified-bot signal — so you can confirm a request really came from the operator. Fourth, a stated purpose: training, search, or real-time fetching.

OpenAI's and Anthropic's crawler docs are examples of this pattern: a named token, a documented user-agent, and a published way to verify. When all four are present, you can set and enforce a deliberate policy.

Why transparency matters for policy

You can only govern what you can identify. A documented token lets robots.txt and edge rules target a crawler precisely; a published source lets you verify the request is genuine rather than a spoof; a stated purpose lets you decide whether being crawled aligns with what you want.

Without disclosure, none of that is possible. An undeclared scraper offers no token to target, no source to verify, and no purpose to weigh — so the only signal left is behaviour, which is slower and less certain to act on.

Documented token — so you can target it in robots.txt and edge rules
Self-identifying user-agent with an operator URL
Published source — so you can verify a request is genuine
Stated purpose — training, search, or real-time fetch

Disclosure is not enforcement

A transparent crawler that honours robots.txt is still relying on cooperation. Disclosure tells you who is fetching and lets you express policy; it does not by itself stop a non-compliant client. The two work together: use the disclosure to set clear robots.txt and edge policy, and back the policy you must enforce with rules that act on verified source.

Treat the absence of disclosure as a signal in itself. Sustained crawler-like traffic with no token and no verifiable source warrants closer scrutiny than a fully declared crawler, even if both fetch the same pages.

How it appears in analytics and logs

A request from a well-documented token with a verifiable source is a disclosed crawler you can govern with robots.txt and edge rules. Traffic with a vague or absent identifier and no published source is undeclared and can only be judged on behaviour.

Diagnostic use case

Assess how transparent an AI crawler is before setting policy: check for a documented token, a self-identifying user-agent, a published source for verification, and a stated purpose, then allow or restrict accordingly.

What WebmasterID can help detect

WebmasterID classifies declared AI crawlers by token server-side and flags traffic that resembles a crawler but lacks a clear identifier, so you can see which AI activity is disclosed versus undeclared on the bot-intelligence surface.

Common mistakes

Trusting a self-identifying user-agent without verifying the published source.
Assuming a disclosed crawler's robots.txt compliance is the same as enforcement.
Ignoring undeclared crawler-like traffic because it lacks a token to react to.
Setting one policy for all AI crawlers regardless of their stated purpose.

Privacy and accuracy notes

Transparency here concerns the crawler operator's disclosures, not visitor identity. Verifying a crawler keys on its published token and network source; no human data is involved.

Frequently asked questions

How do I tell a transparent AI crawler from an undeclared scraper?: A transparent crawler declares a stable token, carries a user-agent linking to operator docs, publishes a network source you can verify, and states its purpose. An undeclared scraper offers none of these, so you can only judge it by request behaviour.

↑ All AI crawlers in AI crawlers

Sources and verification notes

OpenAI — bots documentationExample of a disclosed crawler: token, user-agent, and published source.
Anthropic — crawler support articleDocuments ClaudeBot's token and how operators can identify and block it.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.