How to block the Internet Archive crawler
The Internet Archive operates crawlers (historically using the ia_archiver token, and more recently archive.org_bot) that capture public pages for the Wayback Machine. This page explains how the crawler identifies itself, the robots.txt rule to disallow it, and the important caveat that the Archive's robots.txt handling has changed over time.
What the Internet Archive crawler is
The Internet Archive runs crawlers that capture public web pages for the Wayback Machine, a public archive of historical snapshots. Older crawls used the ia_archiver token; the Archive has also crawled under archive.org_bot. Match on the documented token, and be aware the Archive may use more than one.
Archiving is the Archive's purpose, so a Disallow asks them not to capture or to honour your exclusion — but the Archive sets its own policy on how robots.txt affects existing and future snapshots.
- robots.txt tokens seen: ia_archiver, archive.org_bot
- Purpose: capturing public-page snapshots for the Wayback Machine
- Archiving policy and robots.txt handling are set by the Archive
robots.txt rule
To request that the Internet Archive's crawler not capture your site:
User-agent: ia_archiver Disallow: /
User-agent: archive.org_bot Disallow: /
Note that the Internet Archive has, at times, changed how it interprets robots.txt for archiving — historically honouring ia_archiver exclusions, but later stating it may not always use robots.txt to govern what is archived. For removal of existing snapshots, contact the Archive directly rather than relying on robots.txt alone.
How it appears in analytics and logs
Requests carrying an Internet Archive token are archival crawl events, not human visits. They indicate the Wayback Machine is capturing snapshots of your public pages; treat them as bot traffic.
Diagnostic use case
Ask the Internet Archive's crawler not to capture your public pages for the Wayback Machine, while understanding that archiving policy and robots.txt handling are set by the Archive.
What WebmasterID can help detect
WebmasterID classifies the Internet Archive crawler server-side and shows whether it still reaches your pages after a robots.txt change, so you can see whether archival crawling continues.
Common mistakes
- Assuming robots.txt removes pages already captured in the Wayback Machine — it does not; contact the Archive.
- Targeting only one Archive token when more than one has been used.
- Expecting robots.txt to enforce archiving policy the Archive controls.
Privacy and accuracy notes
Blocking the Archive crawler relies only on the request user-agent token. No visitor identity is involved, and WebmasterID records the crawl as a bot event kept out of human analytics.
Related pages
- robots.txt basics: what it does and what it cannot do
robots.txt is a plain-text file at your site root that tells compliant crawlers which paths they may request. This page covers the directives, how user-agent groups are matched, and the limits that trip people up: robots.txt is advisory, it does not hide pages from search, and it is not a security boundary.
- The noarchive robots directive explained
noarchive is a robots directive that asks search engines not to offer a cached copy of a page. This page explains where to set it, which engines historically honoured it, and why its practical relevance changed after Google retired its cache link.
- How to block the Censys scanner
Censys runs internet-wide scanning that catalogs hosts and services for security research. Because it operates at the host/port level rather than fetching pages as a polite web crawler, robots.txt is largely ineffective. This page explains what Censys does and why firewall-level controls, not robots.txt, are the right response.
- Web crawler reference
How declared crawlers identify themselves in your logs.
Sources and verification notes
- Internet Archive — Wayback Machine helpWayback Machine archiving and exclusion guidance; tokens ia_archiver and archive.org_bot observed in crawl logs.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.