Auditing crawls with server log files
A server log file crawl audit reads raw access logs to see exactly how crawlers interact with your site: which URLs each bot fetched, what status codes they received, how often, and how much of your crawl is spent on low-value paths. Because logs record every request server-side, they reveal crawl behaviour that JavaScript analytics and sampled reports cannot — the ground truth of who fetched what.
What a log audit reveals
Every request a server handles is recorded in its access log: the requesting user agent, the URL path, the HTTP status returned, the timestamp, and usually the client IP. Filtering those records to crawler user agents shows the real crawl: which pages bots fetched, how frequently, and what they received.
Unlike client-side analytics — which most bots never trigger because they do not run JavaScript — logs capture every server request. That makes them the authoritative source for crawl coverage, crawl frequency, and status-code distribution per bot.
- Identify which crawlers fetched which URLs, and how often
- See the exact status codes each crawler received per URL
- Spot crawl budget spent on parameters, facets, or dead URLs
How to run the audit
Start by segmenting requests by crawler. Match on documented robots.txt tokens and self-identifying URL patterns rather than trusting a raw user-agent string, since UA strings can be copied. Where authenticity matters — for example to confirm Googlebot or Bingbot — verify the source IP using each operator's published verification method (reverse DNS, or their published ranges), never an invented range.
Then profile what they fetched: count status codes per bot (a wall of 404s, 301 chains, or 5xx points to specific problems), rank the most-crawled paths (are bots spending budget on faceted or parameter URLs?), and compare crawled URLs against your sitemap to find orphan or excluded pages being crawled. Finally, watch crawl frequency over time against server load.
What to do with the findings
Translate patterns into fixes: collapse redirect chains to single hops, return 410 for permanently removed URLs, block or canonicalise crawl traps, and ensure important pages are actually being fetched. If a verified crawler is hammering low-value URLs, that is crawl budget you can redirect to content that matters.
Logs also expose impostors: a request claiming to be a major search bot from an IP outside that operator's published ranges is not that bot. Treat such traffic as unverified automation, not a trusted crawler, and corroborate against the operator's official verification guidance.
How it appears in analytics and logs
A log audit shows the actual requests crawlers made — user agent, URL, status, timestamp. It is the authoritative view of crawl coverage and waste, free of sampling, and the basis for verifying claimed crawlers against published ranges.
Diagnostic use case
Audit a site's crawl by analysing access logs: confirm which crawlers reach which pages, surface status-code patterns, and find where crawl budget is wasted.
What WebmasterID can help detect
WebmasterID classifies requests server-side and surfaces crawler activity without you parsing raw logs, complementing a manual log audit by attributing fetches to known bots and flagging unknown ones.
Common mistakes
- Trusting the raw user-agent string instead of verifying major crawlers by IP/reverse DNS.
- Publishing or storing raw client IPs from logs without treating them as sensitive.
- Relying on JavaScript analytics for crawl data, which most bots never trigger.
- Inventing IP ranges to verify a crawler instead of using the operator's published method.
Privacy and accuracy notes
A log audit focuses on crawler requests, which are not people. Access logs can contain IPs; treat them as sensitive, never publish raw IPs, and use coarse, aggregate views rather than per-visitor tracking.
Frequently asked questions
- Why use server logs instead of analytics for crawl data?
- Most crawlers do not execute JavaScript, so they never appear in client-side analytics. Server logs record every request regardless, making them the authoritative source for what crawlers actually fetched.
- How do I verify a request really came from Googlebot?
- Use Google's published verification method — reverse DNS to a googlebot.com/google.com host, then forward-confirm — or check against Google's published ranges. Never trust the user-agent string alone, and never invent an IP range.
Related pages
- Server log analysis for crawlers
Server logs record every request, making them the most reliable record of what crawlers actually fetched, when, and with what status. Analysing them reveals crawl coverage, errors, and waste that analytics tools miss. Doing it well means verifying claimed bots rather than trusting user-agents, and handling log data in a privacy-safe way.
- Crawl budget waste: causes and fixes
Crawl budget is the finite attention a search engine spends on your site. It is wasted when crawlers spend it on low-value URLs — endless faceted combinations, parameter variants, soft 404s, and redirect chains — instead of your important pages. Reducing that waste helps key content get crawled.
- Analysing the Search Console Crawl Stats report
The Crawl Stats report in Google Search Console (under Settings) shows how Googlebot crawled your site over the last 90 days: total crawl requests, total download size, average response time, and breakdowns by response code, file type, crawl purpose (discovery vs refresh), and Googlebot type. Reading it well tells you whether crawling is healthy and where it is being wasted.
- Bot intelligence
Server-side classification of crawlers and automation without manual log parsing.
Sources and verification notes
- Google Search Central — Verifying Googlebot and other crawlers
- Google Search Central — Large site owner's guide to managing crawl budget
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.