Robots & crawl control

How to block Archive-It

Archive-It is the Internet Archive's subscription web-archiving service, used by libraries and institutions to capture and preserve websites. Its crawls are performed by the Internet Archive's crawler infrastructure, which uses a documented robots.txt token. This page explains how to ask Archive-It crawls to stay out and the caveats around archival capture.

Partially verified

What this means

Archive-It lets libraries, universities and other institutions build curated web-archive collections, captured using the Internet Archive's crawler. Blocking it asks those crawls to skip your site so it is not preserved into a subscriber's collection.

Archive-It crawls are distinct from the public Wayback Machine's own broad crawling, though both run on Internet Archive infrastructure. A robots.txt rule asks the crawler not to fetch; it does not retroactively remove pages already captured, which is handled through the Internet Archive's own removal processes.

How to block it

Target the Internet Archive crawler token in its own user-agent group. Match on the stable token rather than a full version string.

User-agent: archive.org_bot Disallow: /

Because robots.txt is advisory, verify in your logs that archival fetches stop. Note that an individual Archive-It collection's crawl configuration can be set by the curating institution, so behaviour around robots.txt may depend on how that crawl was scoped.

robots.txt token to target: archive.org_bot
Archive-It runs on Internet Archive crawler infrastructure
A block does not remove already-captured pages

How it appears in analytics and logs

A request from the Internet Archive crawler token associated with Archive-It is an archival fetch on behalf of a subscribing institution, not a human visit. It is bot traffic. The user agent is a claim, so treat repeated hits as archival crawl coverage.

Diagnostic use case

Ask Archive-It crawls to skip your site so pages are not captured into a subscriber's web-archive collection, and confirm the crawler's behaviour in your logs.

What WebmasterID can help detect

WebmasterID classifies Internet Archive crawler activity server-side, so you can see whether Archive-It collections are still fetching your pages after a robots.txt change, without reading raw logs.

Common mistakes

Assuming a robots.txt rule deletes pages already archived — that needs the Internet Archive's removal process.
Confusing Archive-It (subscription collections) with the general Wayback Machine crawl.
Counting archival crawler hits as human traffic.

Privacy and accuracy notes

Blocking Archive-It relies only on the request user-agent token. No human identity is involved. WebmasterID records the crawl as a bot event, separate from human analytics, and never attaches it to a visitor profile.

↑ All robots topics in Robots & crawl control

Sources and verification notes

Internet Archive — Archive-It helpArchive-It crawling and robots.txt handling documentation.
Internet Archive — crawler and robots.txt FAQarchive.org_bot token and robots.txt behaviour.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.