Crawl diagnostics

Diagnosing a blocked crawler

When a crawler is not reaching your pages, the block can come from several layers: a robots.txt Disallow, a server-side 403, a WAF or bot-management rule, or an IP filter. Confirming which layer is responsible — rather than guessing — is the key to fixing it without opening doors you meant to keep shut.

Verified against primary sources

Where a block can come from

Crawler access can be cut off at different layers, and they behave differently. robots.txt Disallow asks compliant crawlers not to fetch certain paths — but it is a request, not enforcement. A server-side 403 is an enforced refusal regardless of credentials. A WAF or bot-management rule may return 403 (or a challenge) to traffic it classifies as automated. An IP-level firewall block drops the connection outright.

Knowing which layer is responsible tells you where to make the change.

robots.txt Disallow — request to compliant crawlers
403 / WAF rule — enforced server-side refusal or challenge
IP firewall block — connection refused at the network layer

How to confirm the cause

Start with robots.txt: does a Disallow match the path and the crawler's token? Next check the response: is the crawler getting a 403 or a challenge page rather than a 200? If so, inspect WAF and bot-management logs for the rule that fired. If connections fail entirely, look at IP-level filtering.

Before allowlisting anything, verify the crawler is genuinely who it claims to be using the operator's published verification method — do not trust the user-agent alone, since it can be spoofed.

Check robots.txt for a matching Disallow
Check for 403 / challenge responses in logs
Verify the crawler before allowlisting it

Operator checklist

Identify the token and path. Rule out robots.txt, then 403/WAF, then IP filtering, in order. Verify a legitimate crawler before allowlisting. Remember robots.txt does not protect content — use real access controls for anything that must stay private.

How it appears in analytics and logs

A blocked crawler shows up as missing crawl coverage, robots.txt Disallow matches, or 403 responses. The layer matters: robots.txt is a request to compliant crawlers, while a 403/WAF/IP block is an enforced server-side refusal.

Diagnostic use case

Confirm why a legitimate crawler is not reaching pages, isolating whether robots.txt, a 403, a WAF, or an IP filter is responsible before making changes.

What WebmasterID can help detect

WebmasterID can show which crawlers reach which paths and where they receive 403s, helping you localise a block to robots.txt, an edge rule, or an IP filter.

Common mistakes

Assuming robots.txt is blocking when a WAF 403 is the real cause.
Allowlisting a crawler by trusting its user agent instead of verifying it.
Relying on robots.txt to protect private content from non-compliant clients.

Privacy and accuracy notes

Diagnosis uses request-level signals — user-agent tokens, status codes, robots rules — not visitor identity or raw IP addresses. WebmasterID reports crawler activity without exposing individual visitors.

↑ All diagnostic topics in Crawl diagnostics

Sources and verification notes

Google Search Central — robots.txt introductionDocuments that robots.txt is a crawling request, not access control.
MDN — 403 Forbidden

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.