Robots & crawl control

Using robots.txt to protect crawl budget

On large sites, crawlers spend a finite amount of effort — often called crawl budget — and can waste it on low-value or near-duplicate URLs. robots.txt can steer them away from those paths so they reach your important pages more often. This matters mostly for big sites; small sites rarely need it.

Verified against primary sources

When crawl budget matters

Google describes crawl budget as a concern primarily for large sites — many thousands of URLs — or sites that generate many URL variations. On a small site, Google generally crawls efficiently and you do not need to manage budget. The problem appears when low-value URLs (endless filter combinations, session parameters, near-duplicates) consume crawl effort that would be better spent on your real content.

Mainly relevant to large or URL-heavy sites
Faceted navigation and parameters generate low-value URLs
Small sites rarely need to manage crawl budget

What to disallow

Use robots.txt to disallow patterns that produce low-value crawling, for example parameter URLs that do not change content meaningfully:

User-agent: * Disallow: /*?sort= Disallow: /*?sessionid=

Be careful not to block resources crawlers need to render pages (CSS, JS) or pages you actually want indexed. Note that Disallowing a URL prevents crawling but does not deindex an already-indexed URL — for that, allow crawling and use noindex. robots.txt steers crawl effort; it is not a deindexing tool.

Disallow low-value parameter and filter patterns
Do not block CSS/JS or pages you want indexed
Disallow steers crawling, it does not deindex

How it appears in analytics and logs

Heavy crawl activity on parameter or low-value URLs can crowd out crawling of pages you care about. Disallowing those paths redirects crawl effort toward higher-value content.

Diagnostic use case

On a large site, stop crawlers from spending effort on faceted-navigation, parameter, or other low-value URLs so important pages are crawled more reliably.

What WebmasterID can help detect

WebmasterID shows which paths crawlers spend requests on, so you can see whether crawl effort is going to low-value URLs and confirm a robots.txt change redirected it.

Common mistakes

Blocking CSS/JS while trying to save crawl budget, harming rendering.
Expecting a Disallow to deindex already-indexed parameter URLs.
Managing crawl budget on a small site that does not need it.

Privacy and accuracy notes

Crawl-budget rules are public configuration. They involve no visitor data; do not list sensitive paths expecting them to be hidden.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Google — Crawl budget management for large sitesDocuments crawl budget and that it matters mainly for large sites.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.