Robots & crawl control

robots.txt basics: what it does and what it cannot do

robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.

Verified against primary sources

What robots.txt is

robots.txt lives at https://yourdomain/robots.txt and follows the Robots Exclusion Protocol. Crawlers fetch it before crawling and apply the rules in the group that matches their user-agent token. It is the standard way to express crawl preferences at scale.

The core directives

A group starts with one or more User-agent lines, followed by Allow and Disallow path rules. A Sitemap line (allowed anywhere in the file) points crawlers at your sitemap.

User-agent: <token> — which crawler the group applies to (* = default)
Disallow: /path — ask crawlers not to fetch matching paths
Allow: /path — carve out exceptions inside a disallowed area
Sitemap: https://… — advertise your sitemap location

What robots.txt cannot do

Three limits matter. First, it is advisory: compliant crawlers honour it, but nothing forces a client to. Second, Disallow blocks crawling, not indexing — a blocked URL linked from elsewhere can still surface in results; use a noindex meta tag or header to keep a page out of search. Third, it is not security — never rely on it to protect private content.

How it appears in analytics and logs

A path blocked in robots.txt should stop compliant crawlers from fetching it — but a disallowed URL can still appear in search if it is linked elsewhere, because Disallow blocks crawling, not indexing.

Diagnostic use case

Write a correct robots.txt, understand which crawlers a rule applies to, and avoid the classic mistake of using Disallow to try to hide a page.

What WebmasterID can help detect

WebmasterID shows which crawlers reach which paths, so you can confirm a robots.txt change had the intended effect on compliant crawlers and spot non-compliant clients that ignore it.

Common mistakes

Using Disallow to try to keep a page out of search (use noindex instead).
Listing sensitive paths in robots.txt, advertising them to everyone.
Assuming robots.txt blocks all bots — non-compliant clients ignore it.
Blocking CSS/JS that crawlers need to render the page.

Privacy and accuracy notes

robots.txt is public; anyone can read it. Do not list secret paths there as if hiding them — you would simply advertise them. Use authentication for anything truly private.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.