AI crawler reference: every AI and LLM bot, explained
A source-grounded reference to the crawlers that fetch web pages on behalf of AI assistants, LLM training pipelines, and AI search products. Each page explains what the crawler is, how it appears in logs where documented, robots.txt considerations, and what WebmasterID can help you detect — with no invented user-agent strings or behavior.
103 AI crawlers documented · part of the Web Crawler & Traffic Intelligence Encyclopedia.
- GPTBot — OpenAI's web crawler
GPTBot is the crawler OpenAI uses to fetch publicly available web content that may be used to help train its foundation models. It is a declared, well-documented crawler with a stable robots.txt token, and OpenAI publishes both documentation and an IP range list so operators can identify and control it.
- ClaudeBot — Anthropic's web crawler
ClaudeBot is the web crawler operated by Anthropic to fetch publicly available content. It is a declared crawler with a documented robots.txt token, and Anthropic publishes guidance for operators who want to identify or restrict it. It is separate from Claude-User, the agent that fetches pages when a person asks Claude to browse.
- PerplexityBot — Perplexity's web crawler
PerplexityBot is the crawler operated by Perplexity to index publicly available web pages for its AI answer engine. Perplexity documents the crawler and its robots.txt token. It is separate from Perplexity-User, which fetches a page in real time in response to a user's question.
- ChatGPT-User — OpenAI real-time fetcher
ChatGPT-User is the token OpenAI uses for real-time fetches made when a person in ChatGPT browses or asks it to read a URL. It is distinct from GPTBot, which crawls for model training, and OpenAI documents both. It honours robots.txt and identifies itself with the ChatGPT-User token plus a self-identifying URL.
- OAI-SearchBot — OpenAI search crawler
OAI-SearchBot is the token OpenAI uses for crawling that supports its search features. OpenAI documents it as distinct from GPTBot, which crawls for model training, and from ChatGPT-User, the real-time browsing fetcher. It identifies itself with the OAI-SearchBot token plus a self-identifying URL.
- Claude-User — Anthropic real-time fetcher
Claude-User is the token Anthropic uses for real-time fetches made when a person asks Claude to read a specific URL. It is distinct from ClaudeBot, the background crawler, and Anthropic documents both. It identifies itself with the Claude-User token plus a self-identifying URL.
- Google-Extended — Google AI training control
Google-Extended is not a crawler or a user-agent string. It is a robots.txt token that lets site owners control whether their content is used to improve Google's generative AI models such as Gemini and Vertex AI. Googlebot continues to crawl for Search normally regardless of the Google-Extended setting.
- Applebot-Extended — Apple AI training control
Applebot-Extended is a robots.txt token Apple provides so site owners can opt out of having their content used to train Apple's generative AI models. It is a control, not a separate crawler: Applebot remains the user agent that powers Apple search features and Siri, and it keeps crawling regardless of the Applebot-Extended setting.
- CCBot — Common Crawl crawler
CCBot is the crawler operated by Common Crawl to build its open, freely available web dataset. That dataset is widely reused as a training source by many AI projects. Common Crawl documents the crawler and its robots.txt token, and CCBot honours robots.txt.
- Bytespider — ByteDance crawler
Bytespider is a web crawler affiliated with ByteDance. Its robots.txt token is Bytespider, and it appears in server logs as an automated fetcher. Public documentation is limited, so some specifics about its purpose and behaviour are marked partially verified rather than guessed.
- Amazonbot — Amazon crawler
Amazonbot is the web crawler operated by Amazon. Amazon documents the crawler, its robots.txt token, and how site owners can control it. Amazonbot honours robots.txt and identifies itself with the Amazonbot token plus a self-identifying URL.
- Meta-ExternalAgent — Meta AI crawler
Meta-ExternalAgent is the token Meta uses for its crawler supporting AI products. Meta documents it alongside the related Meta-ExternalFetcher token. It identifies itself with the Meta-ExternalAgent token plus a self-identifying URL and honours robots.txt.
- YouBot — You.com crawler
YouBot is the crawler operated by You.com to support its search and AI assistant. Its robots.txt token is YouBot. Public documentation is limited in places, so specifics that cannot be confidently sourced are marked partially verified rather than guessed.
- DuckAssistBot — DuckDuckGo assist crawler
DuckAssistBot is the token DuckDuckGo uses for crawling that supports its AI assist features. DuckDuckGo documents its crawlers and robots.txt handling. Where a specific detail is not clearly documented, it is marked partially verified rather than guessed.
- cohere-ai — Cohere crawler
cohere-ai is a crawler token associated with Cohere. It appears in server logs as an automated fetcher. Public documentation is limited, so specifics about its purpose and behaviour are marked partially verified rather than guessed; no behaviour is invented.
- AI2Bot — Allen Institute for AI crawler
AI2Bot is the crawler operated by the Allen Institute for AI (AI2) to gather web data for its datasets and research. AI2 documents the crawler and its robots.txt token. Where a specific is not clearly covered it is marked partially verified rather than guessed.
- ImagesiftBot — image dataset crawler
ImagesiftBot is an image-focused web crawler associated with ImageSift (linked to Hive). Its robots.txt token is ImagesiftBot. Public documentation is limited in places, so specifics that cannot be confidently sourced are marked partially verified rather than guessed.
- Omgilibot — Webz.io data crawler
Omgilibot is a web data crawler operated by Webz.io, also seen under the omgili name. Its robots.txt token is omgilibot. Public documentation is limited in places, so specifics that cannot be confidently sourced are marked partially verified rather than guessed.
- TimpiBot — Timpi crawler
TimpiBot is a crawler associated with Timpi, a decentralized-search project. It appears in server logs as an automated fetcher carrying the TimpiBot token. Public documentation is sparse, so its specifics are treated as data not yet verified and only the identification pattern is described.
- Perplexity-User — Perplexity real-time fetch
Perplexity-User is the token Perplexity uses for real-time fetches triggered by a user's question, as opposed to the PerplexityBot indexing crawler. Perplexity documents both. It identifies itself with the Perplexity-User token plus a self-identifying URL, and is a user-triggered fetch rather than a bulk crawl.
- Claude-SearchBot — Anthropic search crawler
Claude-SearchBot is a token Anthropic uses for crawling that supports Claude's search features. It is distinct from ClaudeBot (background crawler) and Claude-User (real-time user fetch). Anthropic documents its crawlers; where a specific is thin, it is marked partially verified rather than guessed.
- MistralAI-User — Mistral fetch agent
MistralAI-User is the token Mistral uses for real-time fetches that support Le Chat, its assistant. Mistral documents its agents; where a specific is not clearly covered, it is marked partially verified rather than guessed. It identifies itself with the MistralAI-User token plus a self-identifying URL.
- Meta-ExternalFetcher — Meta on-demand fetch
Meta-ExternalFetcher is the token Meta uses for on-demand fetches, as opposed to Meta-ExternalAgent, its bulk AI crawler. Meta documents both. It identifies itself with the Meta-ExternalFetcher token plus a self-identifying URL and is controlled separately in robots.txt.
- Diffbot — knowledge-graph crawler
Diffbot is a crawler operated by Diffbot that extracts structured data from web pages to build and maintain a knowledge graph. Diffbot documents its crawler and robots.txt token. It identifies itself with the Diffbot token plus a self-identifying URL.
- Webzio-Extended — Webz.io AI data opt-out
Webzio-Extended is a robots.txt token Webz.io provides so site owners can control whether their content is used for AI-related data products. Webz.io operates web-data crawlers; where a specific is thin, it is marked partially verified rather than guessed.
- anthropic-ai — Anthropic legacy token
anthropic-ai is a historical robots.txt token associated with Anthropic's earlier crawling, now superseded by ClaudeBot. Anthropic's current guidance centres on ClaudeBot and its user-triggered fetcher. Keeping a legacy rule is harmless but the active control is ClaudeBot.
- Claude-Web — legacy Anthropic token
Claude-Web is a historical robots.txt token associated with Anthropic's earlier crawling. Anthropic's current documented tokens are ClaudeBot and the user-triggered Claude-User. Where the legacy token's exact scope is unclear, it is marked partially verified rather than guessed.
- VelenPublicWebcrawler — Velen.io crawler
VelenPublicWebcrawler is a web-data crawler associated with Velen.io. Its robots.txt token is VelenPublicWebcrawler. Public documentation is limited, so specifics about its exact purpose and behaviour are marked as not yet verified rather than guessed.
- How AI crawlers differ from search crawlers
AI crawlers, traditional search crawlers, and real-time fetchers overlap in mechanics but differ in purpose: training a model, indexing for a search engine, or fetching a page live for a user. Understanding the distinction lets you set robots.txt policy and read your logs accurately.
- AI training crawlers vs AI search crawlers
Within a single AI vendor, training and search are usually handled by separate crawlers with separate robots.txt tokens. OpenAI's GPTBot crawls for training while OAI-SearchBot supports search features. Treating them as one control leads to policy mistakes.
- Real-time AI fetcher agents
Real-time AI fetcher agents — such as ChatGPT-User, Claude-User, and Perplexity-User — retrieve a specific page live when a person asks an assistant about it. They are user-triggered, not bulk crawls, and each has its own robots.txt token controlled separately from the vendor's background crawler.
- Should you block AI crawlers?
Whether to block AI crawlers is a trade-off between visibility in AI products and control over how your content is used. There is no universally correct answer. This entry lays out the considerations honestly, without legal overclaims, and points to the robots.txt mechanics.
- Verifying AI crawlers
Any client can copy a user-agent string, so a token alone is a claim, not proof. Some vendors, such as OpenAI for GPTBot, publish IP ranges or verification guidance; many do not. Verify before trusting, and never invent IP ranges to fill the gap.
- AI crawler traffic patterns
AI crawler activity often shows up as crawl waves — bursts as a vendor refreshes coverage — or as steadier background streams. Reading these patterns helps you interpret spikes correctly and, crucially, keep bot traffic separate from human analytics.
- How to opt out of AI training
Opting your content out of AI training is done through robots.txt: per-crawler tokens such as GPTBot and CCBot, plus dedicated control tokens like Google-Extended and Applebot-Extended. There is no single switch — you assemble the policy token by token, and it is a request to compliant systems.
- AI bot allowlist vs blocklist strategy
Two strategies for AI bots: a blocklist that allows everything except named bots (default-open), or an allowlist that blocks everything except named bots (default-closed). Each has a different maintenance cost and failure mode as new crawlers appear.
- Common Crawl and AI training data
Common Crawl publishes a large open web dataset gathered by its CCBot crawler. Because the dataset is freely redistributed, it has become a common training source across many AI projects. Allowing CCBot therefore has reach well beyond any single product.
- ByteDance crawlers overview
ByteDance, the company behind TikTok, operates web crawlers including Bytespider. Operators have reported relatively heavy crawling from ByteDance-affiliated tokens, but public documentation is limited, so volume and behaviour specifics are marked partially verified rather than asserted.
- Apple Intelligence and Applebot-Extended
Apple's AI features, branded Apple Intelligence, can draw on web content that Applebot crawls for Apple's services. Applebot-Extended is the robots.txt token that lets site owners opt out of that AI-training use while Applebot keeps crawling for Search and Siri.
- Undeclared AI scrapers and how they appear
Some AI scrapers do not declare a recognisable token. They appear with generic user agents, browser-like strings, or forged identities. They cannot be identified by a clean token, so the honest approach is to describe the pattern, verify what you can, and categorise conservatively.
- AI crawler vs AI referral traffic
An AI crawler hit is a bot fetching your page; an AI referral is a human who clicked through to your site from an AI assistant or answer engine. They are different events with different value, and merging them corrupts both your bot metrics and your human analytics.
- Measuring AI crawl coverage
AI crawl coverage is the share of your important URLs that declared AI crawlers have actually fetched in a window. Measuring it means joining a list of crawl-worthy pages to observed bot requests by token, then looking at which URLs were reached, how recently, and which were missed. It is a server-side measurement built from request logs, not from human analytics.
- Rate-limiting AI crawlers
Rate-limiting AI crawlers throttles how fast they fetch without fully blocking them. Options range from robots.txt crawl-delay (honoured by some crawlers, ignored by others) to server-side or CDN request limits that return 429 Too Many Requests. The goal is to protect origin capacity while still allowing AI crawlers to read your content over time.
- AI crawlers, CDN and WAF
Most AI-crawler traffic hits your CDN and WAF before it ever reaches the origin. That edge layer is where allow, throttle, challenge, and block decisions are most effective. Some CDNs ship managed rules and verified-bot lists for AI crawlers; the trade-off is that a JavaScript challenge can break a legitimate crawler that does not execute scripts.
- AI crawlers and paywalled content
AI crawlers can only ingest what your server returns to them. For paywalled or metered content, that depends on whether the page is gated by hard access control or by a soft, client-side wall. robots.txt asks compliant crawlers to stay out; only real authentication or server-side gating actually prevents an AI crawler from reading the full text.
- Do AI crawlers obey robots.txt?
Major declared AI crawlers such as GPTBot, ClaudeBot, and Google-Extended document that they honour robots.txt, but compliance is voluntary and varies across operators. robots.txt is a crawl request defined by a shared standard, not an access-control mechanism, so a non-compliant or undeclared scraper can ignore it. Enforcement requires server-side controls.
- Tracking GPTBot activity in logs
Tracking GPTBot means isolating requests whose user-agent carries the GPTBot token, verifying them against OpenAI's published IP ranges, then reporting which URLs were fetched, how often, and how recently. It is a server-side log exercise that should keep GPTBot out of human analytics and distinguish it from OpenAI's other tokens, ChatGPT-User and OAI-SearchBot.
- AI crawl budget and server load
Each AI crawler spends a finite budget on your site and consumes real origin resources per request. Inefficient URL structures, parameter explosions, and uncacheable dynamic pages waste that budget and amplify load. Reducing wasted fetches lets the budget reach your important content while keeping CPU, database, and bandwidth use sustainable.
- AI crawler user-agent spoofing
Any client can put GPTBot or ClaudeBot in its User-Agent header, because that header is supplied by the client and never validated by HTTP. Spoofers do this to borrow a trusted crawler's reputation or to get around rules. The defence is verifying the request's network source against the operator's published ranges, not trusting the string.
- llms.txt and AI crawlers
llms.txt is a proposed convention: a Markdown file at your site root that points AI systems to your most important, LLM-friendly content. It is not robots.txt and not an access control — it is a curation hint. Adoption by AI crawlers is voluntary and uneven, so treat it as a complement to, not a replacement for, robots.txt and server-side controls.
- Does blocking AI crawlers hurt SEO?
Blocking AI-training crawlers such as GPTBot or CCBot does not remove your site from Google Search, because Googlebot is a separate crawler with its own token. The genuine trade-off is AI visibility: blocking AI crawlers can keep your content out of those AI systems. Search ranking and AI ingestion are governed by different tokens and different controls.
- HTTP response codes and AI crawlers
AI crawlers act on the HTTP status you return. A 200 invites ingestion; 301/308 moves them to a new URL; 403 or 401 signals refusal; 404/410 says the page is gone; 429 asks them to slow down; 5xx says try again later. Returning the right code is how you steer a compliant AI crawler without blunt blocking, and the wrong code can mislead it for a long time.
- AI crawlers and JavaScript rendering
Many AI crawlers fetch raw HTML and do not execute JavaScript, so content injected client-side may be invisible to them. Rendering behaviour varies by operator and is often undocumented, so the safe assumption is that important content should be present in the server-rendered HTML. Server-side rendering or pre-rendering keeps content reachable regardless of a crawler's JS support.
- Separating AI crawler and search-bot traffic
AI crawlers and classic search bots arrive together but serve different purposes, honour different controls, and deserve different policies. Separating them in logs — by token, not by a generic bot flag — lets you allow Googlebot for Search while setting independent rules for GPTBot, ClaudeBot, and others. Mixing them produces misleading totals and the wrong policy decisions.
- Monitoring for new AI crawlers
New AI crawlers appear regularly, often with tokens you have never seen. Monitoring for them means surfacing unfamiliar bot-like user agents, checking each against the operator's documentation before deciding policy, and resisting both reflexive blocking and reflexive trust. The aim is a deliberate, sourced decision for each new token rather than a static, stale allow/block list.
- Documenting your AI crawler policy
An AI crawler policy is a written record of which AI tokens you allow, throttle, or block, and why. Documenting it — alongside your robots.txt and edge rules — keeps decisions consistent as the crawler landscape changes, makes intent reviewable, and prevents the silent drift that happens when rules accrete without rationale. It is governance, not enforcement.
- How often AI crawlers revisit pages
AI crawlers revisit pages on their own schedules, influenced by perceived importance, update frequency, and each operator's budget. There is no fixed interval, and it differs per crawler. Reading recrawl recency from logs tells you how current each AI system's view of a page is — and stale recency on important pages is a coverage signal worth acting on.
- Geographic patterns in AI crawl traffic
AI crawl traffic often originates from a small set of cloud regions where the operator runs infrastructure. The coarse edge region of a request is not the operator's headquarters and not a person's location — it reflects where the crawl is hosted. Reading crawl geography privately means treating region as a coarse infrastructure estimate, never a precise or personal one.
- Bedrockbot — Amazon Bedrock crawler
Bedrockbot is a crawler Amazon documents in association with Amazon Bedrock, used to retrieve web content for Bedrock features. Amazon lists it among its crawlers with a robots.txt token. It is distinct from Amazonbot; identify each by its own token and set policy separately.
- GoogleOther — Google non-Search crawler
GoogleOther is a generic crawler Google uses for fetches not tied to building the Search index — for example internal research and development crawls. Google documents it as separate from Googlebot, with its own robots.txt token. Controlling GoogleOther does not affect Googlebot's Search crawling, and vice versa.
- AI agent browsers and operator agents
AI agent browsers — sometimes called operator agents — drive a real or headless browser to complete tasks a user asked for, such as filling a form or reading a page. Unlike training crawlers, they act per-session on a person's behalf, so they can render JavaScript, follow links interactively, and may or may not declare a stable token. This entry explains the pattern without inventing any specific product's user-agent string.
- Bot-management vendors and AI crawlers
CDN and bot-management vendors such as Cloudflare and Akamai now ship managed rules and toggles aimed specifically at AI crawlers, letting operators allow, challenge, or block known AI bots at the edge. This entry explains what those managed controls do, their limits, and why first-party measurement stays necessary even when an edge vendor handles enforcement.
- AI crawler honeypots and traps
An AI crawler honeypot is a deliberately planted resource — a hidden link, a disallowed path, or an endlessly generated 'tar-pit' page — used to detect or slow crawlers that ignore robots.txt. Tools such as Nepenthes popularised the tar-pit approach. This entry explains the techniques, what they can prove, and why they are a detection aid rather than enforcement.
- Measuring AI referral vs AI crawl
AI crawl and AI referral measure different things: a crawl is an AI system fetching your page; a referral is a human clicking through to your site from an AI answer or assistant. They use different signals — user-agent tokens versus referrer/landing context — and can move independently. This entry explains how to measure each without conflating them.
- AI crawler content licensing
Beyond allow-or-block, a third path is emerging: licensing content to AI vendors, or charging for crawl access. Publishers have signed content deals, and platforms have piloted pay-per-crawl mechanisms. This entry explains how licensing and monetization relate to crawler controls, factually and without revenue promises.
- Google CloudVertexBot
CloudVertexBot is a Google crawler that fetches website content for Vertex AI Agents when a site owner sets up that integration. Google documents it as a separate token, distinct from Googlebot's Search crawling and from Google-Extended's training-control token. It is owner-directed: it crawls sites at the request of the party building the agent.
- PanguBot — Huawei's AI crawler
PanguBot is a crawler reported in third-party crawler directories and operator logs as associated with Huawei's Pangu large-model effort. The robots.txt token PanguBot is observed and catalogued, but Huawei publishes limited official operator documentation, so this entry identifies it by token while marking unverifiable specifics as such rather than inventing them.
- Andibot — Andi Search's crawler
Andibot is the crawler token associated with Andi, an AI-based answer search engine. The token is catalogued in independent crawler directories and observed in logs. Andi publishes limited official operator documentation, so this entry identifies Andibot by token and flags unverifiable specifics rather than inventing IP ranges or behaviour.
- AwarioBot — Awario's listening crawler
AwarioBot is the crawler token associated with Awario, a social-listening and brand-monitoring service. It fetches web content to support mention tracking rather than to train a foundation model. The token is catalogued in crawler directories; this entry identifies it by token and marks unverifiable specifics as such rather than inventing them.
- AI crawler impact on analytics
When AI-crawler requests leak into human analytics, they inflate page views, skew bounce and engagement rates, and make traffic look healthier than it is. Because many crawlers do not run client-side JavaScript, client-only analytics often undercounts them while server logs see them. This entry explains the distortion in both directions and how to keep human metrics clean.
- AI crawlers and HTTP 429 rate limits
HTTP 429 Too Many Requests is the standard way to tell a crawler it is sending too many requests. A compliant AI crawler should back off, ideally honouring a Retry-After header. This entry explains how 429 interacts with AI crawlers, the Retry-After mechanism, and why 429 is a cooperative signal rather than a hard block.
- Detecting AI crawlers without a user agent
Not every AI crawler declares a clean token — some send a blank, generic, or browser-like user agent. You cannot identify those by token alone. This entry describes the behavioural and network signals that flag likely automated AI fetching, while being explicit that behaviour suggests a class, not a named vendor, and that you must never invent identity.
- Operator agent traffic patterns
Operator agents — AI systems completing a task for one user — leave a different log signature than indexing crawlers. Instead of a steady, breadth-first sweep, they produce short, bursty, goal-directed sessions that may render pages and interact with forms. This entry describes those patterns so you can recognise agent runs without inventing a vendor identity.
- AI data partnerships vs scraping
An AI model can ingest your content two ways: by crawling your live site, or through a licensed data partnership or third-party dataset such as Common Crawl. These leave very different footprints — crawling shows in your logs, licensed ingestion may not. This entry explains the distinction so you do not misread a quiet crawl log as proof your content is absent from AI.
- AI crawler consent and opt-out signals
Several signals beyond a plain robots.txt block exist to express AI-use preferences: per-token robots.txt rules, the W3C TDM Reservation Protocol, and proposed meta directives such as noai/noimageai. They differ in scope and in how widely they are honoured. This entry maps the consent-signal landscape factually, without overstating which crawlers obey which.
- AI crawlers, caching, and snapshots
An AI assistant can present content from a stored snapshot taken during an earlier crawl rather than fetching your page in real time. That means an AI may reference a version of your page that no longer matches the live one, and your logs may show no recent crawl despite active AI usage. This entry explains snapshot behaviour and its measurement consequences.
- AI crawlers and edge/firewall rules
Edge and firewall rules are the most direct place to set AI-crawler policy: they evaluate every request before it reaches your application, so you can allow a declared crawler, rate-limit a noisy one, or block an undeclared scraper without writing application code. The reliable rule keys on the robots.txt token plus a verified network source, because a user-agent string alone is spoofable.
- AI crawlers and structured data
Structured data — schema.org markup in JSON-LD, Microdata, or RDFa — gives crawlers an explicit, machine-readable description of a page's entities. AI crawlers can ingest it the same way they ingest the rest of the HTML, and clean markup can make extraction more reliable. It is a supplement to clear content, not a substitute, and it never overrides the visible text a model actually reads.
- AI crawlers and CDN bandwidth costs
AI crawlers consume real bandwidth: every fetched page, image, and asset is billable egress on most CDNs. A broad or repeated crawl can move serving costs without moving audience, because none of it is a human visit. Caching, conditional requests, and rate limits keep the bill proportional to the value of being crawled.
- AI crawler disclosure and transparency
A transparent AI crawler is one you can identify and reason about: it declares a stable robots.txt token, carries a self-identifying user-agent pointing at operator documentation, publishes its network source so you can verify it, and states what it fetches content for. Disclosure is what separates a crawler you can set policy on from an undeclared scraper you can only detect by behaviour.
- AI crawlers and bot-challenge pages
Bot-challenge pages — JavaScript challenges, interactive puzzles, and managed challenge interstitials — are designed to separate human browsers from automated clients. Most legitimate AI crawlers do not execute JavaScript or solve interactive challenges, so a challenge usually blocks them even when you only meant to filter abuse. Allowing a crawler means exempting its verified token from the challenge.
- AI crawlers and sitemap priority
An XML sitemap lists the URLs you want discovered and carries optional hints like lastmod, changefreq, and priority. For AI crawlers a sitemap is a discovery aid, not a command: it helps them find and re-check pages, but crawlers decide for themselves what to fetch. Accurate lastmod is the most useful signal; priority is advisory and widely ignored.
- AI crawler traffic in analytics dashboards
AI crawler activity often lands in the same dashboards as human traffic, where it can look like an audience that is not there. Whether a crawler shows up depends on how you count: server-side logging records every request including crawlers, while client-side JavaScript analytics usually miss crawlers that do not run scripts. Reading crawl separately keeps human metrics honest.
- AI crawlers and canonical tags
A rel=canonical link tells crawlers which URL is the preferred version of duplicate or near-duplicate content. For AI crawlers it consolidates signals onto one URL and reduces wasted fetches across query-string and parameter variants. Like robots and sitemap hints, canonical is a strong suggestion that crawlers usually respect but are free to override.
- AI crawlers and HTTP redirects
When a URL an AI crawler requests redirects, the crawler generally follows it the way a browser would, fetching the redirect target. Clean single-hop redirects pass content along efficiently; long chains and loops waste crawl budget and can cause a crawler to give up. The status code matters: 301 signals a permanent move, 302 a temporary one.
- AI crawlers and server-side rendering
Server-side rendering (SSR) returns a fully built HTML document from the server, so the content is present in the initial response without needing a browser to run JavaScript. For AI crawlers — many of which fetch HTML but do not reliably execute client-side scripts — SSR makes your text dependably available, whereas client-side rendering risks delivering an empty shell.
- Attributing AI crawler costs
AI crawlers consume real resources: bandwidth, origin CPU, cache misses, and CDN egress. Cost attribution means assigning those costs to the crawler that caused them, using the request token and response size recorded in logs. Done well, it turns a vague 'bots are expensive' worry into a per-crawler figure you can act on.
- AI crawlers: API vs HTML access
AI systems can reach your content two ways: by crawling your public HTML pages, or by calling a structured API or feed you expose. HTML crawling is uncontrolled discovery of whatever is public; API access is an explicit, shaped channel you can authenticate, rate-limit, and version. The choice shapes how much control and visibility you keep.
- AI crawler incident response
An AI crawler incident is a crawl event that threatens stability or trust: a sudden request surge, a crawl that loads the origin near failure, or a request claiming a crawler identity it cannot prove. Good incident response is staged — verify, contain, then decide — so you protect the site without permanently blocking a crawler over a transient spike.
- AI crawlers and content syndication
Content syndication republishes your work on other domains — partners, aggregators, or licensees. AI crawlers may encounter the syndicated copy before or instead of your original, so without clear canonical signals the copy can become the version that is ingested and attributed. Managing syndication for AI access is mostly about pointing crawlers back to the source.
- AI crawlers and first-party data
First-party data here means crawl records your own server captures directly — request token, URL, status, timing — rather than data gathered by client-side scripts. Because most AI crawlers do not execute JavaScript, client analytics miss them almost entirely. First-party server-side records are the dependable way to see what AI crawlers actually did on your site.
- AI crawlers and metered paywalls
A metered paywall lets visitors read a set number of articles before requiring payment, usually tracked with cookies or counters in the browser. AI crawlers rarely carry cookies or session state, so browser-style metering does not constrain them the way it constrains people. Metering crawler access takes server-side rules keyed on the request, not client counters.
- AI crawlers and log retention
Log retention is how long you keep request records. For AI crawler analysis, longer retention reveals trends — which crawlers grew, when a new one appeared, how coverage changed — that short windows hide. The balance is keeping enough crawl history to be useful while not retaining personal data beyond what its purpose and law require.
- AI crawlers and Search Console
Google Search Console reports how Googlebot crawls and how your site performs in Google Search. It is scoped to Google's own crawlers and search — it does not show GPTBot, ClaudeBot, PerplexityBot, or any non-Google AI crawler. To see AI crawler activity you need server-side records, not Search Console, which has no view of other operators' bots.
- Reading AI crawler benchmarks skeptically
Published benchmarks of AI crawler volume and share circulate widely, but they disagree because each measures a different sample — one network's customers, one site type, one window — and labels crawlers differently. Treat any single ranking as a sample-specific estimate, not a universal fact, and trust your own server-side data over a vendor's aggregate for your site.
- AI crawler CDN rule examples
CDN edge rules let you act on AI crawler requests before they reach your origin: rate-limit a token, serve it from cache, or challenge it. This page walks through example rule shapes and the principle behind them — match on the documented token for routing, but verify the source for anything security-sensitive, because user agents are spoofable.
- AI crawlers and RSS and Atom feeds
An RSS or Atom feed is a structured XML list of your recent content, designed for machine consumption. For AI crawlers it offers a clean discovery and ingestion channel: titles, links, dates, and often full or summary content in a predictable format, so a crawler can find new items without re-parsing your HTML. Feeds complement, rather than replace, page crawling.
- AI crawlers: prerendering vs server-side rendering
Both prerendering and server-side rendering aim to hand a crawler complete HTML, but they differ in how. SSR builds the page on the server for every request; prerendering generates a static snapshot ahead of time (or via a separate render service) and serves that. Each reliably feeds non-JS AI crawlers, but they differ in freshness, complexity, and consistency.
- AI crawlers and content negotiation
Content negotiation lets a server return different representations of a URL based on request headers like Accept and Accept-Encoding. AI crawlers send these headers too, so the variant they receive depends on what they advertise and what you serve. Mishandled negotiation — wrong Vary header, or serving crawlers a different representation than humans — can distort what is ingested.
- AI crawlers and conditional requests
Conditional requests let a crawler ask 'send this only if it changed' using validators it stored from a prior fetch — an ETag or a Last-Modified date. If the page is unchanged, the server replies 304 Not Modified with no body, saving bandwidth and origin work. Supporting conditional requests well makes re-crawling by AI crawlers efficient for both sides.
- Budgeting AI crawler load by token
Where cost attribution measures what a crawler costs, budgeting by token sets what it is allowed to cost. You assign each documented crawler token a request-rate and bandwidth allowance sized to its value and your capacity, then enforce it at the edge. Budgeting turns reactive incident response into a standing policy that keeps any one crawler from dominating resources.
- AI crawler traffic and log sampling
Log sampling keeps only a fraction of requests to save storage and cost. It is fine for high-level trends but distorts AI crawler analysis: a newly appearing or low-volume crawler can vanish entirely from a sampled view, and per-token counts become estimates. Knowing whether your logs are sampled — and at what rate — is essential to trusting AI crawl numbers.
- Kagibot — Kagi's web crawler
Kagibot is the crawler run by Kagi, an independent paid search engine. Kagi documents the crawler, its robots.txt token, a self-identifying URL in the user agent, and published IP addresses with hostnames for verification. Notably, if no Kagibot-specific robots.txt rule exists, Kagibot follows the Googlebot directives instead.
Other reference hubs
See how WebmasterID applies this in product: Bot intelligence, AI referrals, and AI visibility analytics.