AI crawlers

AI crawlers and CDN bandwidth costs

AI crawlers consume real bandwidth: every fetched page, image, and asset is billable egress on most CDNs. A broad or repeated crawl can move serving costs without moving audience, because none of it is a human visit. Caching, conditional requests, and rate limits keep the bill proportional to the value of being crawled.

Partially verified

Crawls are billable egress

Most CDNs bill on egress — the bytes served to clients — and a crawler is a client. Every HTML page, image, script, and font an AI crawler fetches counts toward that egress the same as a human page view would, except no person saw it.

For a large site, a thorough crawl multiplies the per-page byte cost across thousands of URLs and their assets. If several AI crawlers each crawl the full site on their own schedules, the aggregate can be a meaningful line on the CDN bill that is easy to misread as traffic growth.

Why repeated fetches add up

Crawlers re-fetch. A training crawler may revisit to pick up changes, a search-oriented crawler refreshes its index, and a real-time agent fetches on demand. Each pass re-downloads pages unless caching and conditional requests let the crawler or the CDN avoid resending unchanged bytes.

Without cache headers and ETags, a crawler has no way to know a page is unchanged and will re-download it in full. Strong caching and support for conditional requests (If-Modified-Since / If-None-Match returning 304) let unchanged content be skipped, cutting repeat egress substantially.

Every crawled page and asset is billable CDN egress
Multiple crawlers each crawling the whole site multiply the cost
Cache headers and 304 responses cut repeat downloads of unchanged content

Keeping cost proportional to value

Not every crawl is worth its bandwidth. Decide per token whether being crawled brings visibility you value, then size the cost against it. A search-oriented crawler that drives discovery may justify its egress; a training crawler you do not benefit from may not.

Levers are caching (so the CDN serves cached bytes cheaply rather than hitting origin), conditional requests (so unchanged pages return 304), and rate limits (so a single token cannot fetch faster than you want to pay for). The exact savings depend on your site and provider, so measure before and after rather than assuming a figure.

How it appears in analytics and logs

A rise in CDN egress with no matching rise in human sessions usually means crawler volume, not audience. If a single AI token accounts for a large share of bytes served, it is a cost driver worth caching or throttling.

Diagnostic use case

Estimate and contain the CDN bandwidth that AI crawlers consume: separate crawler egress from human egress, cache aggressively, and rate-limit tokens whose fetch volume outweighs the visibility they bring.

What WebmasterID can help detect

WebmasterID separates AI-crawler requests from human traffic server-side, so you can see which tokens fetched how many pages and attribute crawl volume to the right source rather than reading it as audience on the bot-intelligence surface.

Common mistakes

Reading rising CDN egress as audience growth when it is crawler volume.
Serving every crawler request from origin instead of a cache.
Omitting cache headers and ETags, forcing crawlers to re-download unchanged pages.
Letting one high-volume token consume bandwidth without measuring the visibility it returns.

Privacy and accuracy notes

Bandwidth attribution here keys on the crawler token and request volume, not on visitor identity. Country at the CDN edge is a coarse estimate used for cost analysis, never for tracking people.

↑ All AI crawlers in AI crawlers

Sources and verification notes

MDN — HTTP conditional requestsExplains If-Modified-Since / ETag and 304 responses that avoid re-downloads.
MDN — HTTP cachingCache headers that let a CDN serve repeat fetches without re-fetching origin.

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.