robots.txt and URL query parameters
Query-string URLs (?sort=, ?utm_source=, ?sessionid=) can multiply crawlable URLs. This page explains how robots.txt wildcards match parameters, when blocking helps, and why canonical or noindex is often better than a Disallow for duplicates.
Matching parameters with wildcards
Google supports * (any sequence) and $ (end of URL) in robots.txt paths. To block crawling of any URL containing a specific parameter, match the parameter pattern:
User-agent: * Disallow: /*?*sort=
That blocks any path with a sort= parameter. To block all query strings on a path, use Disallow: /search?* — but be careful, because over-broad parameter blocks can also hide useful pages. Test patterns before deploying them.
- * matches any sequence of characters
- $ anchors the end of the URL
- Disallow: /*?*param= blocks URLs containing that parameter
Block vs canonical vs noindex
Blocking parameter URLs in robots.txt stops crawling but, like any Disallow, prevents the crawler from seeing a canonical or noindex on those URLs. For duplicate content (sort/filter variants of the same content), a rel=canonical to the clean URL usually consolidates signals better than a block.
Use a Disallow when the parameter URLs are genuinely worthless to crawl (session IDs, infinite calendars). Use canonical/noindex when the variants should still pass signals or be discoverable. Google also no longer offers a URL Parameters tool, so robots.txt and on-page signals are the levers now.
How it appears in analytics and logs
Lots of crawler hits on ?-parameter variants of the same page mean crawlers are exploring parameter space — often a crawl-budget drain rather than valuable indexing.
Diagnostic use case
Stop crawlers wasting crawl budget on infinite parameter combinations (faceted navigation, session IDs) while keeping canonical pages indexable.
What WebmasterID can help detect
WebmasterID shows which parameterized URLs crawlers hit, so you can tell whether parameter-handling rules are actually curbing wasteful crawling.
Common mistakes
- Blocking parameter URLs that carry a canonical you wanted crawlers to read.
- Writing an over-broad pattern that also blocks important pages.
- Putting sensitive values in query strings and trusting robots.txt to hide them.
Privacy and accuracy notes
Parameter rules concern URL patterns, not visitors. Avoid relying on robots.txt to hide parameters that carry sensitive values — keep secrets out of URLs entirely.
Related pages
- Wildcards and path matching in robots.txt
Although the original protocol used simple prefix matching, major crawlers support two wildcards in path rules: * matches any sequence of characters, and $ anchors the end of the URL. This page covers how they behave, useful patterns, and the mistakes that make a rule too broad.
- The Clean-param directive in robots.txt explained
Clean-param is a Yandex-specific robots.txt directive that lists URL query parameters Yandex should ignore when crawling, helping consolidate duplicate URLs. This page explains its syntax, what it does, and why Google relies on different mechanisms.
- Canonical vs noindex: which to use
rel=canonical and noindex are often confused. Canonical tells search engines which of several similar URLs to treat as the primary, consolidating signals onto it. noindex removes a page from the index entirely. This page explains when each is right and why combining them on one URL sends conflicting signals.
- Attribution analytics
Understand UTM and parameter URLs in your traffic data.
Sources and verification notes
- Google — How Google interprets robots.txt (path matching, wildcards)Documents * and $ wildcard support in robots.txt paths.
- Google — Consolidate duplicate URLs (canonicalization)Canonical signals for parameter duplicates.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.