Using robots.txt to protect crawl budget
On large sites, crawlers spend a finite amount of effort — often called crawl budget — and can waste it on low-value or near-duplicate URLs. robots.txt can steer them away from those paths so they reach your important pages more often. This matters mostly for big sites; small sites rarely need it.
When crawl budget matters
Google describes crawl budget as a concern primarily for large sites — many thousands of URLs — or sites that generate many URL variations. On a small site, Google generally crawls efficiently and you do not need to manage budget. The problem appears when low-value URLs (endless filter combinations, session parameters, near-duplicates) consume crawl effort that would be better spent on your real content.
- Mainly relevant to large or URL-heavy sites
- Faceted navigation and parameters generate low-value URLs
- Small sites rarely need to manage crawl budget
What to disallow
Use robots.txt to disallow patterns that produce low-value crawling, for example parameter URLs that do not change content meaningfully:
User-agent: * Disallow: /*?sort= Disallow: /*?sessionid=
Be careful not to block resources crawlers need to render pages (CSS, JS) or pages you actually want indexed. Note that Disallowing a URL prevents crawling but does not deindex an already-indexed URL — for that, allow crawling and use noindex. robots.txt steers crawl effort; it is not a deindexing tool.
- Disallow low-value parameter and filter patterns
- Do not block CSS/JS or pages you want indexed
- Disallow steers crawling, it does not deindex
How it appears in analytics and logs
Heavy crawl activity on parameter or low-value URLs can crowd out crawling of pages you care about. Disallowing those paths redirects crawl effort toward higher-value content.
Diagnostic use case
On a large site, stop crawlers from spending effort on faceted-navigation, parameter, or other low-value URLs so important pages are crawled more reliably.
What WebmasterID can help detect
WebmasterID shows which paths crawlers spend requests on, so you can see whether crawl effort is going to low-value URLs and confirm a robots.txt change redirected it.
Common mistakes
- Blocking CSS/JS while trying to save crawl budget, harming rendering.
- Expecting a Disallow to deindex already-indexed parameter URLs.
- Managing crawl budget on a small site that does not need it.
Privacy and accuracy notes
Crawl-budget rules are public configuration. They involve no visitor data; do not list sensitive paths expecting them to be hidden.
Related pages
- Wildcards and path matching in robots.txt
Although the original protocol used simple prefix matching, major crawlers support two wildcards in path rules: * matches any sequence of characters, and $ anchors the end of the URL. This page covers how they behave, useful patterns, and the mistakes that make a rule too broad.
- robots.txt common mistakes
Most robots.txt problems come from a handful of recurring mistakes. This page collects the big ones — blocking the CSS and JS crawlers need to render, trying to deindex with Disallow, advertising secret paths, and treating an advisory file as enforcement — with the correct approach for each.
- robots.txt basics: what it does and what it cannot do
robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.
- Website observability
See where crawlers spend requests and whether a change helped.
Sources and verification notes
- Google — Crawl budget management for large sitesDocuments crawl budget and that it matters mainly for large sites.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.