Search bots

Archive-It crawler (Internet Archive)

Archive-It is a subscription web-archiving service run by the Internet Archive, used by libraries, universities, and institutions to capture and preserve selected websites on a schedule. Its crawler fetches the public pages an institution has chosen to archive, building curated collections rather than indexing the whole web for search. It appears in logs as archival fetches associated with the Internet Archive.

Partially verified

What this means

Archive-It lets institutions build and manage their own web archives through the Internet Archive. A curator selects sites and seeds, sets crawl frequency, and Archive-It captures those pages into a collection that is preserved and made accessible.

Unlike the broad Wayback crawl, Archive-It is curated and scoped: it captures what a subscribing institution chose to preserve, on their schedule, rather than crawling the entire web.

How it identifies itself

Archive-It crawling is operated by the Internet Archive and identifies via Internet Archive / archive-it infrastructure and user-agents. Match on the documented archival identity rather than an exact version. As with any user-agent, the string is a claim and can be copied.

Because exact tokens and IP ranges are not exhaustively published and the service is curated per institution, this entry is marked partially verified; the archival purpose and Internet Archive association are the reliable signals.

Operator: Internet Archive (Archive-It service)
Curated, institution-selected collections
Scheduled capture, not whole-web indexing

robots.txt considerations

Archive-It crawls can be configured to honour or, for some institutional collections, override robots.txt depending on the curator's settings and policy. To express a preference, target the relevant archival token in robots.txt, but understand that curated archiving policies vary.

robots.txt is a request honoured by compliant crawlers, not an access control, and does not retroactively remove already-captured snapshots.

How it appears in analytics and logs

An Archive-It fetch means an institution has scheduled your site for preservation in a curated collection. It is archival bot traffic and preservation coverage, not search indexing or human audience.

Diagnostic use case

Recognise curated archival crawling from Archive-It in logs, separate it from broad search indexing and from the general Wayback crawl, and read it as institutional preservation.

What WebmasterID can help detect

WebmasterID classifies Archive-It fetches server-side as archival bot traffic and shows which pages were captured, keeping curated archiving out of human analytics.

Common mistakes

Confusing curated Archive-It capture with whole-web search indexing.
Assuming robots.txt always stops curated institutional archiving.
Counting archival crawl hits as human visits.

Privacy and accuracy notes

Identification uses the request user-agent and archival context only. No visitor identity is involved. WebmasterID records the fetch as a bot event, separate from human analytics.

↑ All search bots in Search bots

Sources and verification notes

Internet Archive — Archive-ItCurated web-archiving service; exact crawler tokens and ranges not exhaustively published.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.