robots.txt for PDFs and non-HTML files
PDFs and other non-HTML files can rank in search. This page explains why X-Robots-Tag noindex (not robots.txt Disallow) is the right way to keep a PDF out of the index, and when blocking the file directory is appropriate.
Why Disallow is the wrong tool
PDFs have no <head>, so you cannot add a meta robots tag inside them. The instinct is to Disallow the PDF in robots.txt, but that only blocks crawling — Google can still show a URL-only listing for a known PDF it is not allowed to fetch, and crucially it can never see a noindex you intended.
To remove a PDF from search, you must let crawlers fetch it and return a noindex instruction, which means not blocking it in robots.txt.
- PDFs cannot carry a meta robots tag
- Disallow blocks crawling, not indexing of a known URL
- A blocked file's noindex header is never read
Use X-Robots-Tag noindex
Send a noindex directive in the HTTP response header for the file:
X-Robots-Tag: noindex
Apply it to the PDF response (for example via your server or CDN config matching *.pdf). Because the crawler must fetch the file to read the header, do not also Disallow the path in robots.txt. Once the PDF is reprocessed and dropped from the index, you can decide whether to block crawling for resource reasons.
How it appears in analytics and logs
Crawler hits on .pdf URLs mean documents are being fetched for indexing. A PDF can appear in results as a standalone search listing, separate from any HTML page that links to it.
Diagnostic use case
Keep specific PDFs, spreadsheets, or other documents out of search results without the common mistake of blocking them in robots.txt and leaving a URL-only listing.
What WebmasterID can help detect
WebmasterID logs crawler hits on non-HTML URLs too, so you can see which documents crawlers fetch and whether your indexing controls match intent.
Common mistakes
- Disallowing a PDF in robots.txt and still seeing a URL-only listing.
- Blocking the file so its X-Robots-Tag noindex is never fetched.
- Assuming robots.txt makes a confidential PDF private.
Privacy and accuracy notes
These rules concern your own files, not visitors. robots.txt and meta directives do not secure confidential documents — use authentication for genuinely private files.
Related pages
- robots.txt for images
robots.txt can control how image crawlers like Googlebot-Image fetch your images. This page explains how to allow or disallow image crawling, the trade-off with Google Images visibility, and why blocking images for search is different from blocking pages.
- X-Robots-Tag header examples
X-Robots-Tag carries indexing directives in the HTTP response header instead of the HTML body, which makes it the way to apply noindex or nofollow to PDFs, images, and other non-HTML files. This page gives concrete header examples and notes how server config applies them in bulk.
- The noindex meta tag
The noindex value of the meta robots tag tells search engines to keep a page out of their index. The catch trips people up constantly: for noindex to work, the crawler must be able to fetch the page — so you must not block the same URL in robots.txt.
- WebmasterID docs
See which non-HTML files crawlers fetch on your site.
Sources and verification notes
- Google — robots meta tag, data-nosnippet, and X-Robots-TagX-Robots-Tag applies noindex to non-HTML files like PDFs.
- Google — Block search indexing with noindexExplains noindex must be crawlable to be seen.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.