Robots & crawl control

robots.txt vs a firewall/WAF

robots.txt and a firewall/WAF solve different problems: robots.txt politely asks compliant crawlers what to skip, while a firewall or WAF actually blocks requests at the network or edge layer. This page contrasts the two, explains when each is appropriate, and warns against using robots.txt for jobs only enforcement can do.

Verified against primary sources

Different layers, different jobs

robots.txt is an application-level convention: a public file that compliant crawlers read and voluntarily obey. It is ideal for steering search engines and well-behaved bots away from low-value paths, and for managing crawl budget.

A firewall or WAF operates at the network/edge layer and enforces decisions — it can drop, challenge, or rate-limit requests before they reach your app, regardless of whether the client respects robots.txt. Use it for abusive scrapers, credential-stuffing bots, and anything that ignores polite controls.

robots.txt: voluntary, for compliant crawlers, manages crawl behavior
Firewall/WAF: enforced, blocks at the edge, stops non-compliant clients
robots.txt is public; a WAF rule is not advertised to clients

Use them together

These tools are complementary. Keep robots.txt for crawl management — disallowing search-result pages, faceted-navigation duplicates, or non-page endpoints from compliant crawlers. Reserve the WAF for enforcement: blocking known-bad bots, rate-limiting aggressive crawlers, and challenging suspicious automation.

A classic mistake is reaching for robots.txt to "block" a scraper. A scraper that ignores robots.txt is unaffected; only edge enforcement stops it. Conversely, do not WAF-block a search engine you actually want — use robots.txt to shape its crawl instead.

How it appears in analytics and logs

If a crawler keeps hitting paths you disallowed in robots.txt, it is either non-compliant or spoofing a user agent — a signal that enforcement (WAF/firewall), not robots.txt, is needed.

Diagnostic use case

Choose the right tool for a bot problem: robots.txt to steer compliant crawlers, a firewall/WAF to stop abusive or non-compliant traffic that ignores robots.txt.

What WebmasterID can help detect

WebmasterID shows whether a crawler obeyed a robots.txt rule or kept coming, which helps you decide when to escalate from a polite Disallow to firewall-level enforcement.

Common mistakes

Using robots.txt to stop an abusive scraper that ignores it — only a WAF/firewall can.
Blocking a wanted search engine at the WAF instead of shaping its crawl with robots.txt.
Assuming a robots.txt Disallow provides any security guarantee.

Privacy and accuracy notes

Both robots.txt and WAF rules act on requests and user agents, not personal identities. Edge enforcement may use IPs operationally, but that is access control, not visitor profiling.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Google — robots.txt is not an access-control mechanismConfirms robots.txt requests compliance and is not enforcement.
Cloudflare — what a web application firewall (WAF) isWAF enforces request filtering at the edge.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.