robots.txt comments and encoding
robots.txt supports comments with the hash character and is parsed as a UTF-8 plain-text file. Getting the encoding wrong — a stray byte order mark, a non-UTF-8 charset, or comments placed where a directive is expected — can cause crawlers to misread or ignore rules. This page covers comment syntax and the encoding requirements that keep a file valid.
Comment syntax
A comment starts with a hash character and runs to the end of the line. Crawlers ignore everything from the hash onward, so you can document rules inline or on their own line.
# Block our staging crawler User-agent: * Disallow: /staging/ # not for indexing
Keep comments on their own lines or trailing a directive. Do not split a directive across a comment, and remember a comment is purely for humans — it never changes how a rule is applied.
- Comments begin with # and run to end of line
- They can be on their own line or trail a directive
- Comments never affect parsing of the rules themselves
Encoding requirements
Google's specification states robots.txt must be a UTF-8 encoded text file, and crawlers may ignore characters that are not part of UTF-8. A common trap is a UTF-8 byte order mark (BOM) saved at the start of the file by some editors: an unaware parser can treat the BOM as part of the first line, breaking the first directive.
Google's parser specifically tolerates a leading BOM, but not every crawler does, so the safe practice is to save the file as UTF-8 without a BOM. Use Unix line endings, avoid smart-quote substitution from word processors, and serve the file with a text/plain content type.
- robots.txt must be UTF-8
- Save without a byte order mark for broad compatibility
- Serve as text/plain with plain straight characters, not smart quotes
How it appears in analytics and logs
If rules that look correct are being ignored, the cause is often an encoding or comment-placement problem rather than the rule logic. It is a parsing signal, not a sign of crawler misbehaviour.
Diagnostic use case
Annotate a robots.txt file safely and avoid encoding pitfalls — like a UTF-8 BOM — that can make a crawler skip the first directive.
What WebmasterID can help detect
WebmasterID records crawler fetches of your robots.txt, so if a malformed or wrongly encoded file is causing crawlers to behave unexpectedly, you can see the fetch pattern alongside the bot activity it governs.
Common mistakes
- Saving robots.txt with a UTF-8 BOM that breaks the first directive for strict parsers.
- Pasting from a word processor and introducing smart quotes or non-UTF-8 characters.
- Assuming a comment can disable part of a directive — it only ends the line.
Privacy and accuracy notes
Comments and encoding concern the file's text only and never involve visitor data. WebmasterID treats robots.txt fetches by crawlers as bot events, separate from human analytics.
Related pages
- robots.txt basics: what it does and what it cannot do
robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.
- robots.txt common mistakes
Most robots.txt problems come from a handful of recurring mistakes. This page collects the big ones — blocking the CSS and JS crawlers need to render, trying to deindex with Disallow, advertising secret paths, and treating an advisory file as enforcement — with the correct approach for each.
- How to test your robots.txt
A robots.txt rule is only useful if it does what you think. This page covers how to test it — checking the live file, using Google Search Console's robots.txt report and URL Inspection, and confirming in your own logs that the intended crawlers are or are not fetching the affected URLs.
- WebmasterID docs
Reference guides for crawl control and bot intelligence.
Sources and verification notes
- Google — robots.txt specification (file format and encoding)States UTF-8 requirement, BOM handling, and comment syntax.
- Robots Exclusion Protocol (RFC 9309) — file formatDefines comments and UTF-8 file encoding.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.