WebmasterID logoWebmasterID
AI crawlers

AI crawler disclosure and transparency

A transparent AI crawler is one you can identify and reason about: it declares a stable robots.txt token, carries a self-identifying user-agent pointing at operator documentation, publishes its network source so you can verify it, and states what it fetches content for. Disclosure is what separates a crawler you can set policy on from an undeclared scraper you can only detect by behaviour.

Verified against primary sources

What disclosure looks like

A disclosed AI crawler gives you four things you can check. First, a stable robots.txt token so you can address it specifically. Second, a self-identifying user-agent that includes a URL pointing back to the operator's crawler documentation. Third, a published network source — an IP range list or a verified-bot signal — so you can confirm a request really came from the operator. Fourth, a stated purpose: training, search, or real-time fetching.

OpenAI's and Anthropic's crawler docs are examples of this pattern: a named token, a documented user-agent, and a published way to verify. When all four are present, you can set and enforce a deliberate policy.

Why transparency matters for policy

You can only govern what you can identify. A documented token lets robots.txt and edge rules target a crawler precisely; a published source lets you verify the request is genuine rather than a spoof; a stated purpose lets you decide whether being crawled aligns with what you want.

Without disclosure, none of that is possible. An undeclared scraper offers no token to target, no source to verify, and no purpose to weigh — so the only signal left is behaviour, which is slower and less certain to act on.

Disclosure is not enforcement

A transparent crawler that honours robots.txt is still relying on cooperation. Disclosure tells you who is fetching and lets you express policy; it does not by itself stop a non-compliant client. The two work together: use the disclosure to set clear robots.txt and edge policy, and back the policy you must enforce with rules that act on verified source.

Treat the absence of disclosure as a signal in itself. Sustained crawler-like traffic with no token and no verifiable source warrants closer scrutiny than a fully declared crawler, even if both fetch the same pages.

How it appears in analytics and logs

A request from a well-documented token with a verifiable source is a disclosed crawler you can govern with robots.txt and edge rules. Traffic with a vague or absent identifier and no published source is undeclared and can only be judged on behaviour.

Diagnostic use case

Assess how transparent an AI crawler is before setting policy: check for a documented token, a self-identifying user-agent, a published source for verification, and a stated purpose, then allow or restrict accordingly.

What WebmasterID can help detect

WebmasterID classifies declared AI crawlers by token server-side and flags traffic that resembles a crawler but lacks a clear identifier, so you can see which AI activity is disclosed versus undeclared on the bot-intelligence surface.

Common mistakes

Privacy and accuracy notes

Transparency here concerns the crawler operator's disclosures, not visitor identity. Verifying a crawler keys on its published token and network source; no human data is involved.

Frequently asked questions

How do I tell a transparent AI crawler from an undeclared scraper?
A transparent crawler declares a stable token, carries a user-agent linking to operator docs, publishes a network source you can verify, and states its purpose. An undeclared scraper offers none of these, so you can only judge it by request behaviour.

Related pages

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.