Archive-It crawler (Internet Archive)
Archive-It is a subscription web-archiving service run by the Internet Archive, used by libraries, universities, and institutions to capture and preserve selected websites on a schedule. Its crawler fetches the public pages an institution has chosen to archive, building curated collections rather than indexing the whole web for search. It appears in logs as archival fetches associated with the Internet Archive.
What this means
Archive-It lets institutions build and manage their own web archives through the Internet Archive. A curator selects sites and seeds, sets crawl frequency, and Archive-It captures those pages into a collection that is preserved and made accessible.
Unlike the broad Wayback crawl, Archive-It is curated and scoped: it captures what a subscribing institution chose to preserve, on their schedule, rather than crawling the entire web.
How it identifies itself
Archive-It crawling is operated by the Internet Archive and identifies via Internet Archive / archive-it infrastructure and user-agents. Match on the documented archival identity rather than an exact version. As with any user-agent, the string is a claim and can be copied.
Because exact tokens and IP ranges are not exhaustively published and the service is curated per institution, this entry is marked partially verified; the archival purpose and Internet Archive association are the reliable signals.
- Operator: Internet Archive (Archive-It service)
- Curated, institution-selected collections
- Scheduled capture, not whole-web indexing
robots.txt considerations
Archive-It crawls can be configured to honour or, for some institutional collections, override robots.txt depending on the curator's settings and policy. To express a preference, target the relevant archival token in robots.txt, but understand that curated archiving policies vary.
robots.txt is a request honoured by compliant crawlers, not an access control, and does not retroactively remove already-captured snapshots.
How it appears in analytics and logs
An Archive-It fetch means an institution has scheduled your site for preservation in a curated collection. It is archival bot traffic and preservation coverage, not search indexing or human audience.
Diagnostic use case
Recognise curated archival crawling from Archive-It in logs, separate it from broad search indexing and from the general Wayback crawl, and read it as institutional preservation.
What WebmasterID can help detect
WebmasterID classifies Archive-It fetches server-side as archival bot traffic and shows which pages were captured, keeping curated archiving out of human analytics.
Common mistakes
- Confusing curated Archive-It capture with whole-web search indexing.
- Assuming robots.txt always stops curated institutional archiving.
- Counting archival crawl hits as human visits.
Privacy and accuracy notes
Identification uses the request user-agent and archival context only. No visitor identity is involved. WebmasterID records the fetch as a bot event, separate from human analytics.
Related pages
- Archival crawlers overview
Archival crawlers — led by the Internet Archive's Wayback Machine crawling — fetch public pages to preserve point-in-time snapshots for research, journalism, and the historical record. They are not search crawlers: they capture how a page looked, not rank it. Understanding the difference keeps robots.txt and analytics decisions sensible, since archiving and indexing serve different goals.
- archive.org_bot — Internet Archive web crawler
archive.org_bot is a user-agent associated with Internet Archive crawling that fetches public web pages for preservation in collections such as the Wayback Machine. It is an archival agent, distinct from search-engine indexing crawlers, and identifies via an archive.org URL in its user-agent. Operators see it when their public pages are captured for long-term snapshots.
- ia_archiver and the Internet Archive crawler
ia_archiver is a long-standing user-agent token associated with crawling for the Internet Archive's Wayback Machine and related collections. The Internet Archive operates archival crawlers that fetch public pages to preserve snapshots over time. The token has historic ties to the Alexa crawler that fed early Archive collections, so log entries may show ia_archiver or archive.org-related agents depending on the crawl source.
- Web crawlers
How archival and search crawlers are detected and categorised.
Sources and verification notes
- Internet Archive — Archive-ItCurated web-archiving service; exact crawler tokens and ranges not exhaustively published.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.