Robots & crawl control

robots.txt for PDFs and non-HTML files

PDFs and other non-HTML files can rank in search. This page explains why X-Robots-Tag noindex (not robots.txt Disallow) is the right way to keep a PDF out of the index, and when blocking the file directory is appropriate.

Verified against primary sources

Why Disallow is the wrong tool

PDFs have no <head>, so you cannot add a meta robots tag inside them. The instinct is to Disallow the PDF in robots.txt, but that only blocks crawling — Google can still show a URL-only listing for a known PDF it is not allowed to fetch, and crucially it can never see a noindex you intended.

To remove a PDF from search, you must let crawlers fetch it and return a noindex instruction, which means not blocking it in robots.txt.

PDFs cannot carry a meta robots tag
Disallow blocks crawling, not indexing of a known URL
A blocked file's noindex header is never read

Use X-Robots-Tag noindex

Send a noindex directive in the HTTP response header for the file:

X-Robots-Tag: noindex

Apply it to the PDF response (for example via your server or CDN config matching *.pdf). Because the crawler must fetch the file to read the header, do not also Disallow the path in robots.txt. Once the PDF is reprocessed and dropped from the index, you can decide whether to block crawling for resource reasons.

How it appears in analytics and logs

Crawler hits on .pdf URLs mean documents are being fetched for indexing. A PDF can appear in results as a standalone search listing, separate from any HTML page that links to it.

Diagnostic use case

Keep specific PDFs, spreadsheets, or other documents out of search results without the common mistake of blocking them in robots.txt and leaving a URL-only listing.

What WebmasterID can help detect

WebmasterID logs crawler hits on non-HTML URLs too, so you can see which documents crawlers fetch and whether your indexing controls match intent.

Common mistakes

Disallowing a PDF in robots.txt and still seeing a URL-only listing.
Blocking the file so its X-Robots-Tag noindex is never fetched.
Assuming robots.txt makes a confidential PDF private.

Privacy and accuracy notes

These rules concern your own files, not visitors. robots.txt and meta directives do not secure confidential documents — use authentication for genuinely private files.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Google — robots meta tag, data-nosnippet, and X-Robots-TagX-Robots-Tag applies noindex to non-HTML files like PDFs.
Google — Block search indexing with noindexExplains noindex must be crawlable to be seen.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.