AI crawlers

Documenting your AI crawler policy

An AI crawler policy is a written record of which AI tokens you allow, throttle, or block, and why. Documenting it — alongside your robots.txt and edge rules — keeps decisions consistent as the crawler landscape changes, makes intent reviewable, and prevents the silent drift that happens when rules accrete without rationale. It is governance, not enforcement.

Verified against primary sources

Why write it down

robots.txt and WAF rules express what you do, but not why. Over time, rules accumulate: someone blocks a crawler during an incident, someone else allows another for a campaign, and the rationale is lost. A documented policy captures intent so future edits are deliberate.

It also makes review possible. A policy you can read in one place can be checked against your actual config and against new crawlers as they appear, instead of being reverse-engineered from scattered directives.

What to record per token

For each AI token, record the decision (allow, throttle, block), the rationale, the date, and the source you relied on — the operator's documentation. Note the token's purpose (training, search, real-time fetch) so related tokens are not conflated, and link to where the rule lives (robots.txt, CDN, WAF).

Keep it honest about uncertainty: if a crawler is undocumented, say so, and mark the decision as provisional rather than implying confidence you do not have.

Decision, rationale, date, and source per token
Token purpose noted so related tokens are not conflated
Pointer to where the rule is actually enforced

Keeping policy and config in sync

A policy only helps if it matches reality. Periodically reconcile it against your robots.txt and edge rules, and against the tokens you actually observe in traffic. Drift in either direction — a documented block that is not enforced, or an enforced block with no rationale — is a signal to revisit.

Treat the policy as living: when a new token appears or an operator changes its crawlers, update the record and the config together, so intent and enforcement never diverge for long.

How it appears in analytics and logs

If your robots.txt allows a token your documented policy says to block, that mismatch is a drift you can catch by comparing the two. A policy with a rationale per token also tells you whether a rule is still wanted or just inherited.

Diagnostic use case

Maintain a single, reviewable record of your AI-crawler decisions so robots.txt and WAF rules stay aligned with intent and new tokens are handled consistently.

What WebmasterID can help detect

WebmasterID shows the AI tokens actually hitting your site, which you can reconcile against your documented policy on the bot-intelligence surface to find rules that no longer match observed traffic.

Common mistakes

Keeping rules in robots.txt and WAF with no recorded rationale.
Letting documented policy drift out of sync with actual config.
Implying certainty about crawlers whose behaviour is undocumented.
Conflating related tokens (training vs search) in a single policy line.

Privacy and accuracy notes

A crawler policy concerns token-level decisions, not visitor data. It records intent about automated traffic and involves no human identity.

↑ All AI crawlers in AI crawlers

Sources and verification notes

Google — robots.txt specificationrobots.txt expresses crawl rules; a policy document records the rationale behind them.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.