robots.txt basics: what it does and what it cannot do
robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.
What robots.txt is
robots.txt lives at https://yourdomain/robots.txt and follows the Robots Exclusion Protocol. Crawlers fetch it before crawling and apply the rules in the group that matches their user-agent token. It is the standard way to express crawl preferences at scale.
The core directives
A group starts with one or more User-agent lines, followed by Allow and Disallow path rules. A Sitemap line (allowed anywhere in the file) points crawlers at your sitemap.
- User-agent: <token> — which crawler the group applies to (* = default)
- Disallow: /path — ask crawlers not to fetch matching paths
- Allow: /path — carve out exceptions inside a disallowed area
- Sitemap: https://… — advertise your sitemap location
What robots.txt cannot do
Three limits matter. First, it is advisory: compliant crawlers honour it, but nothing forces a client to. Second, Disallow blocks crawling, not indexing — a blocked URL linked from elsewhere can still surface in results; use a noindex meta tag or header to keep a page out of search. Third, it is not security — never rely on it to protect private content.
How it appears in analytics and logs
A path blocked in robots.txt should stop compliant crawlers from fetching it — but a disallowed URL can still appear in search if it is linked elsewhere, because Disallow blocks crawling, not indexing.
Diagnostic use case
Write a correct robots.txt, understand which crawlers a rule applies to, and avoid the classic mistake of using Disallow to try to hide a page.
What WebmasterID can help detect
WebmasterID shows which crawlers reach which paths, so you can confirm a robots.txt change had the intended effect on compliant crawlers and spot non-compliant clients that ignore it.
Common mistakes
- Using Disallow to try to keep a page out of search (use noindex instead).
- Listing sensitive paths in robots.txt, advertising them to everyone.
- Assuming robots.txt blocks all bots — non-compliant clients ignore it.
- Blocking CSS/JS that crawlers need to render the page.
Privacy and accuracy notes
robots.txt is public; anyone can read it. Do not list secret paths there as if hiding them — you would simply advertise them. Use authentication for anything truly private.
Related pages
- How to block GPTBot in robots.txt
If you do not want OpenAI's training crawler fetching your site, you can disallow GPTBot in robots.txt. This page gives the exact rule, clarifies that it does not affect ChatGPT-User or OAI-SearchBot, and is honest about the limits of robots-based blocking.
- GPTBot — OpenAI's web crawler
GPTBot is the crawler OpenAI uses to fetch publicly available web content that may be used to help train its foundation models. It is a declared, well-documented crawler with a stable robots.txt token, and OpenAI publishes both documentation and an IP range list so operators can identify and control it.
- HTTP 404 Not Found: what it means for crawlers
404 Not Found means the server has no resource at that URL. It is the correct, healthy response for genuinely missing pages — crawlers expect some 404s. Problems arise when important pages 404 by accident, when removed pages should signal 410, or when 'not found' pages wrongly return 200.
- Bot intelligence
Confirm which crawlers honour your robots rules.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.