AI crawler disclosure and transparency
A transparent AI crawler is one you can identify and reason about: it declares a stable robots.txt token, carries a self-identifying user-agent pointing at operator documentation, publishes its network source so you can verify it, and states what it fetches content for. Disclosure is what separates a crawler you can set policy on from an undeclared scraper you can only detect by behaviour.
What disclosure looks like
A disclosed AI crawler gives you four things you can check. First, a stable robots.txt token so you can address it specifically. Second, a self-identifying user-agent that includes a URL pointing back to the operator's crawler documentation. Third, a published network source — an IP range list or a verified-bot signal — so you can confirm a request really came from the operator. Fourth, a stated purpose: training, search, or real-time fetching.
OpenAI's and Anthropic's crawler docs are examples of this pattern: a named token, a documented user-agent, and a published way to verify. When all four are present, you can set and enforce a deliberate policy.
Why transparency matters for policy
You can only govern what you can identify. A documented token lets robots.txt and edge rules target a crawler precisely; a published source lets you verify the request is genuine rather than a spoof; a stated purpose lets you decide whether being crawled aligns with what you want.
Without disclosure, none of that is possible. An undeclared scraper offers no token to target, no source to verify, and no purpose to weigh — so the only signal left is behaviour, which is slower and less certain to act on.
- Documented token — so you can target it in robots.txt and edge rules
- Self-identifying user-agent with an operator URL
- Published source — so you can verify a request is genuine
- Stated purpose — training, search, or real-time fetch
Disclosure is not enforcement
A transparent crawler that honours robots.txt is still relying on cooperation. Disclosure tells you who is fetching and lets you express policy; it does not by itself stop a non-compliant client. The two work together: use the disclosure to set clear robots.txt and edge policy, and back the policy you must enforce with rules that act on verified source.
Treat the absence of disclosure as a signal in itself. Sustained crawler-like traffic with no token and no verifiable source warrants closer scrutiny than a fully declared crawler, even if both fetch the same pages.
How it appears in analytics and logs
A request from a well-documented token with a verifiable source is a disclosed crawler you can govern with robots.txt and edge rules. Traffic with a vague or absent identifier and no published source is undeclared and can only be judged on behaviour.
Diagnostic use case
Assess how transparent an AI crawler is before setting policy: check for a documented token, a self-identifying user-agent, a published source for verification, and a stated purpose, then allow or restrict accordingly.
What WebmasterID can help detect
WebmasterID classifies declared AI crawlers by token server-side and flags traffic that resembles a crawler but lacks a clear identifier, so you can see which AI activity is disclosed versus undeclared on the bot-intelligence surface.
Common mistakes
- Trusting a self-identifying user-agent without verifying the published source.
- Assuming a disclosed crawler's robots.txt compliance is the same as enforcement.
- Ignoring undeclared crawler-like traffic because it lacks a token to react to.
- Setting one policy for all AI crawlers regardless of their stated purpose.
Privacy and accuracy notes
Transparency here concerns the crawler operator's disclosures, not visitor identity. Verifying a crawler keys on its published token and network source; no human data is involved.
Frequently asked questions
- How do I tell a transparent AI crawler from an undeclared scraper?
- A transparent crawler declares a stable token, carries a user-agent linking to operator docs, publishes a network source you can verify, and states its purpose. An undeclared scraper offers none of these, so you can only judge it by request behaviour.
Related pages
- Verifying AI crawlers
Any client can copy a user-agent string, so a token alone is a claim, not proof. Some vendors, such as OpenAI for GPTBot, publish IP ranges or verification guidance; many do not. Verify before trusting, and never invent IP ranges to fill the gap.
- Undeclared AI scrapers and how they appear
Some AI scrapers do not declare a recognisable token. They appear with generic user agents, browser-like strings, or forged identities. They cannot be identified by a clean token, so the honest approach is to describe the pattern, verify what you can, and categorise conservatively.
- Documenting your AI crawler policy
An AI crawler policy is a written record of which AI tokens you allow, throttle, or block, and why. Documenting it — alongside your robots.txt and edge rules — keeps decisions consistent as the crawler landscape changes, makes intent reviewable, and prevents the silent drift that happens when rules accrete without rationale. It is governance, not enforcement.
- Bot intelligence
See which AI activity is disclosed by token versus undeclared.
Sources and verification notes
- OpenAI — bots documentationExample of a disclosed crawler: token, user-agent, and published source.
- Anthropic — crawler support articleDocuments ClaudeBot's token and how operators can identify and block it.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.