How to track AI crawlers on your website
A practical, step-by-step walkthrough. We cover what to detect, how to detect it, and how to turn crawl signals into operational decisions.
Decide what to detect
A small allow-list of AI surfaces covers the bulk of practical visibility decisions.
- GPTBot — OpenAI's general-purpose crawler.
- OAI-SearchBot — OpenAI's real-time search crawler.
- ClaudeBot, anthropic-ai, Claude-Web — Anthropic surfaces.
- PerplexityBot — Perplexity's indexing crawler.
- Google-Extended — Google's AI-training-and-Gemini surface.
- CCBot — Common Crawl, used by many downstream LLM datasets.
- Bytespider, cohere-ai, Diffbot, and similar known crawlers.
Pick a detection model
There are three practical options. Pick the one that matches your team.
- Server logs. Grep your access logs against a maintained allow-list of crawler signatures. Cheap but brittle.
- CDN / WAF reports. Cloudflare, Fastly, and similar surfaces categorise some AI bots natively. Limited cross-source visibility.
- First-party tracker (what WebmasterID does). A small script + an ingest layer that classifies at ingest time and correlates crawl events with AI referrals on the same dashboard.
Decide what counts as 'AI' vs 'search' vs 'automation'
Closed categories beat free-text labels. The dashboard groups bots into AI / search / automation / other. Uncategorised bots stay uncategorised — never relabelled.
See Bot Intelligence for the full taxonomy and AI Referrals for how detected crawls correlate with referral traffic.
Connect llms.txt
The llms.txt convention is to AI crawlers what robots.txt is to search-engine bots. Maintaining it (and watching whether crawlers respect it) is part of the workflow.
We track llms.txt fetches separately so you can confirm crawlers actually read the file before applying its rules.
Frequently asked
- Why does tracking AI crawlers matter?
- Because they decide whether your content shows up in AI answers. ChatGPT, Claude, and Perplexity each crawl differently, respect robots.txt and llms.txt differently, and visit at different cadences. Without tracking, you cannot tell which AI surfaces are actually reading your site.
- Can I track AI crawlers from server logs?
- Yes, but it is brittle. Log formats differ, log rotation drops data, and matching User-Agent strings against a maintained list is a recurring task. A first-party tracker that classifies at ingest time is more reliable and lets you correlate with referrals + page views.
- Do I need a cookie banner?
- No. AI crawler detection runs server-side from request signatures only. No cookies, no personal data, no consent UI required for the detection layer.