Archival crawlers overview
Archival crawlers — led by the Internet Archive's Wayback Machine crawling — fetch public pages to preserve point-in-time snapshots for research, journalism, and the historical record. They are not search crawlers: they capture how a page looked, not rank it. Understanding the difference keeps robots.txt and analytics decisions sensible, since archiving and indexing serve different goals.
What this means
Archival crawlers exist to preserve the web. The Internet Archive's Wayback Machine is the best-known: it captures snapshots of public pages so they can be viewed as they appeared at a moment in time. Other libraries and projects run similar preservation crawling.
This is different from search indexing. A search crawler wants to rank your current content; an archival crawler wants a faithful record of your past content. One affects discoverability; the other affects the historical record.
How to read them and what to do
Archival crawling appears in logs as fetches from archive infrastructure, often identifying via archive.org or historic tokens like ia_archiver. Treat it as preservation coverage, not audience.
If you do not want pages archived, target the relevant token in robots.txt, but remember robots.txt is a request, handled differently across archives and over time, and does not retroactively remove existing snapshots. Because exact tokens and ranges vary, this overview is marked partially verified.
- Purpose: snapshot preservation, not ranking
- Identifies via archive.org URLs or historic tokens
- robots.txt does not retroactively remove existing snapshots
How it appears in analytics and logs
Archival crawler hits mean your public pages are being captured as snapshots. It is preservation coverage and bot traffic, not search indexing or human audience.
Diagnostic use case
Classify archival crawl traffic in logs as preservation activity, distinct from search indexing and SEO tools, and make robots.txt decisions accordingly.
What WebmasterID can help detect
WebmasterID groups archival crawler hits server-side as bot traffic and shows which pages were captured, so preservation crawling stays out of human analytics and search-crawl coverage.
Common mistakes
- Confusing archival capture with search indexing.
- Expecting robots.txt to delete pages already archived.
- Counting archival crawl hits as human visits.
Privacy and accuracy notes
Archival crawlers are identified by user-agent and behaviour only. No visitor identity is involved; WebmasterID records their fetches as bot events, never as human profiles.
Related pages
- ia_archiver and the Internet Archive crawler
ia_archiver is a long-standing user-agent token associated with crawling for the Internet Archive's Wayback Machine and related collections. The Internet Archive operates archival crawlers that fetch public pages to preserve snapshots over time. The token has historic ties to the Alexa crawler that fed early Archive collections, so log entries may show ia_archiver or archive.org-related agents depending on the crawl source.
- archive.org_bot — Internet Archive web crawler
archive.org_bot is a user-agent associated with Internet Archive crawling that fetches public web pages for preservation in collections such as the Wayback Machine. It is an archival agent, distinct from search-engine indexing crawlers, and identifies via an archive.org URL in its user-agent. Operators see it when their public pages are captured for long-term snapshots.
- Wayback Machine Save Page Now fetcher
Save Page Now is the Internet Archive feature that captures a specific URL on demand when a person requests a snapshot through the Wayback Machine. Unlike background archival crawling, this fetch happens because someone asked for it right now, making it a user-triggered archival fetch. It appears in logs as an archive.org-identifying request tied to a save request rather than a scheduled crawl.
- Web crawlers
How archival and search crawlers are detected and categorised.
Sources and verification notes
- Internet Archive — Wayback MachinePrimary archival crawling project; exact token set across archives varies.
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.