How to block Archive-It
Archive-It is the Internet Archive's subscription web-archiving service, used by libraries and institutions to capture and preserve websites. Its crawls are performed by the Internet Archive's crawler infrastructure, which uses a documented robots.txt token. This page explains how to ask Archive-It crawls to stay out and the caveats around archival capture.
What this means
Archive-It lets libraries, universities and other institutions build curated web-archive collections, captured using the Internet Archive's crawler. Blocking it asks those crawls to skip your site so it is not preserved into a subscriber's collection.
Archive-It crawls are distinct from the public Wayback Machine's own broad crawling, though both run on Internet Archive infrastructure. A robots.txt rule asks the crawler not to fetch; it does not retroactively remove pages already captured, which is handled through the Internet Archive's own removal processes.
How to block it
Target the Internet Archive crawler token in its own user-agent group. Match on the stable token rather than a full version string.
User-agent: archive.org_bot Disallow: /
Because robots.txt is advisory, verify in your logs that archival fetches stop. Note that an individual Archive-It collection's crawl configuration can be set by the curating institution, so behaviour around robots.txt may depend on how that crawl was scoped.
- robots.txt token to target: archive.org_bot
- Archive-It runs on Internet Archive crawler infrastructure
- A block does not remove already-captured pages
How it appears in analytics and logs
A request from the Internet Archive crawler token associated with Archive-It is an archival fetch on behalf of a subscribing institution, not a human visit. It is bot traffic. The user agent is a claim, so treat repeated hits as archival crawl coverage.
Diagnostic use case
Ask Archive-It crawls to skip your site so pages are not captured into a subscriber's web-archive collection, and confirm the crawler's behaviour in your logs.
What WebmasterID can help detect
WebmasterID classifies Internet Archive crawler activity server-side, so you can see whether Archive-It collections are still fetching your pages after a robots.txt change, without reading raw logs.
Common mistakes
- Assuming a robots.txt rule deletes pages already archived — that needs the Internet Archive's removal process.
- Confusing Archive-It (subscription collections) with the general Wayback Machine crawl.
- Counting archival crawler hits as human traffic.
Privacy and accuracy notes
Blocking Archive-It relies only on the request user-agent token. No human identity is involved. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.
Related pages
- How to block the Internet Archive crawler
The Internet Archive operates crawlers (historically using the ia_archiver token, and more recently archive.org_bot) that capture public pages for the Wayback Machine. This page explains how the crawler identifies itself, the robots.txt rule to disallow it, and the important caveat that the Archive's robots.txt handling has changed over time.
- The noarchive robots directive explained
noarchive is a robots directive that asks search engines not to offer a cached copy of a page. This page explains where to set it, which engines historically honoured it, and why its practical relevance changed after Google retired its cache link.
- Archive-It crawler (Internet Archive)
Archive-It is a subscription web-archiving service run by the Internet Archive, used by libraries, universities, and institutions to capture and preserve selected websites on a schedule. Its crawler fetches the public pages an institution has chosen to archive, building curated collections rather than indexing the whole web for search. It appears in logs as archival fetches associated with the Internet Archive.
- Web crawler reference
How archival and other crawlers identify themselves.
Sources and verification notes
- Internet Archive — Archive-It helpArchive-It crawling and robots.txt handling documentation.
- Internet Archive — crawler and robots.txt FAQarchive.org_bot token and robots.txt behaviour.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.