Robots & crawl control

How to block the Internet Archive crawler

The Internet Archive operates crawlers (historically using the ia_archiver token, and more recently archive.org_bot) that capture public pages for the Wayback Machine. This page explains how the crawler identifies itself, the robots.txt rule to disallow it, and the important caveat that the Archive's robots.txt handling has changed over time.

Partially verified

What the Internet Archive crawler is

The Internet Archive runs crawlers that capture public web pages for the Wayback Machine, a public archive of historical snapshots. Older crawls used the ia_archiver token; the Archive has also crawled under archive.org_bot. Match on the documented token, and be aware the Archive may use more than one.

Archiving is the Archive's purpose, so a Disallow asks them not to capture or to honour your exclusion — but the Archive sets its own policy on how robots.txt affects existing and future snapshots.

robots.txt tokens seen: ia_archiver, archive.org_bot
Purpose: capturing public-page snapshots for the Wayback Machine
Archiving policy and robots.txt handling are set by the Archive

robots.txt rule

To request that the Internet Archive's crawler not capture your site:

User-agent: ia_archiver Disallow: /

User-agent: archive.org_bot Disallow: /

Note that the Internet Archive has, at times, changed how it interprets robots.txt for archiving — historically honouring ia_archiver exclusions, but later stating it may not always use robots.txt to govern what is archived. For removal of existing snapshots, contact the Archive directly rather than relying on robots.txt alone.

How it appears in analytics and logs

Requests carrying an Internet Archive token are archival crawl events, not human visits. They indicate the Wayback Machine is capturing snapshots of your public pages; treat them as bot traffic.

Diagnostic use case

Ask the Internet Archive's crawler not to capture your public pages for the Wayback Machine, while understanding that archiving policy and robots.txt handling are set by the Archive.

What WebmasterID can help detect

WebmasterID classifies the Internet Archive crawler server-side and shows whether it still reaches your pages after a robots.txt change, so you can see whether archival crawling continues.

Common mistakes

Assuming robots.txt removes pages already captured in the Wayback Machine — it does not; contact the Archive.
Targeting only one Archive token when more than one has been used.
Expecting robots.txt to enforce archiving policy the Archive controls.

Privacy and accuracy notes

Blocking the Archive crawler relies only on the request user-agent token. No visitor identity is involved, and WebmasterID records the crawl as a bot event kept out of human analytics.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Internet Archive — Wayback Machine helpWayback Machine archiving and exclusion guidance; tokens ia_archiver and archive.org_bot observed in crawl logs.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.