WebmasterID logoWebmasterID
Crawl diagnostics

Diagnosing a blocked crawler

When a crawler is not reaching your pages, the block can come from several layers: a robots.txt Disallow, a server-side 403, a WAF or bot-management rule, or an IP filter. Confirming which layer is responsible — rather than guessing — is the key to fixing it without opening doors you meant to keep shut.

Verified against primary sources

Where a block can come from

Crawler access can be cut off at different layers, and they behave differently. robots.txt Disallow asks compliant crawlers not to fetch certain paths — but it is a request, not enforcement. A server-side 403 is an enforced refusal regardless of credentials. A WAF or bot-management rule may return 403 (or a challenge) to traffic it classifies as automated. An IP-level firewall block drops the connection outright.

Knowing which layer is responsible tells you where to make the change.

How to confirm the cause

Start with robots.txt: does a Disallow match the path and the crawler's token? Next check the response: is the crawler getting a 403 or a challenge page rather than a 200? If so, inspect WAF and bot-management logs for the rule that fired. If connections fail entirely, look at IP-level filtering.

Before allowlisting anything, verify the crawler is genuinely who it claims to be using the operator's published verification method — do not trust the user-agent alone, since it can be spoofed.

Operator checklist

Identify the token and path. Rule out robots.txt, then 403/WAF, then IP filtering, in order. Verify a legitimate crawler before allowlisting. Remember robots.txt does not protect content — use real access controls for anything that must stay private.

How it appears in analytics and logs

A blocked crawler shows up as missing crawl coverage, robots.txt Disallow matches, or 403 responses. The layer matters: robots.txt is a request to compliant crawlers, while a 403/WAF/IP block is an enforced server-side refusal.

Diagnostic use case

Confirm why a legitimate crawler is not reaching pages, isolating whether robots.txt, a 403, a WAF, or an IP filter is responsible before making changes.

What WebmasterID can help detect

WebmasterID can show which crawlers reach which paths and where they receive 403s, helping you localise a block to robots.txt, an edge rule, or an IP filter.

Common mistakes

Privacy and accuracy notes

Diagnosis uses request-level signals — user-agent tokens, status codes, robots rules — not visitor identity or raw IP addresses. WebmasterID reports crawler activity without exposing individual visitors.

Related pages

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.