WebmasterID logoWebmasterID
Crawl diagnostics

Auditing crawls with server log files

A server log file crawl audit reads raw access logs to see exactly how crawlers interact with your site: which URLs each bot fetched, what status codes they received, how often, and how much of your crawl is spent on low-value paths. Because logs record every request server-side, they reveal crawl behaviour that JavaScript analytics and sampled reports cannot — the ground truth of who fetched what.

Verified against primary sources

What a log audit reveals

Every request a server handles is recorded in its access log: the requesting user agent, the URL path, the HTTP status returned, the timestamp, and usually the client IP. Filtering those records to crawler user agents shows the real crawl: which pages bots fetched, how frequently, and what they received.

Unlike client-side analytics — which most bots never trigger because they do not run JavaScript — logs capture every server request. That makes them the authoritative source for crawl coverage, crawl frequency, and status-code distribution per bot.

How to run the audit

Start by segmenting requests by crawler. Match on documented robots.txt tokens and self-identifying URL patterns rather than trusting a raw user-agent string, since UA strings can be copied. Where authenticity matters — for example to confirm Googlebot or Bingbot — verify the source IP using each operator's published verification method (reverse DNS, or their published ranges), never an invented range.

Then profile what they fetched: count status codes per bot (a wall of 404s, 301 chains, or 5xx points to specific problems), rank the most-crawled paths (are bots spending budget on faceted or parameter URLs?), and compare crawled URLs against your sitemap to find orphan or excluded pages being crawled. Finally, watch crawl frequency over time against server load.

What to do with the findings

Translate patterns into fixes: collapse redirect chains to single hops, return 410 for permanently removed URLs, block or canonicalise crawl traps, and ensure important pages are actually being fetched. If a verified crawler is hammering low-value URLs, that is crawl budget you can redirect to content that matters.

Logs also expose impostors: a request claiming to be a major search bot from an IP outside that operator's published ranges is not that bot. Treat such traffic as unverified automation, not a trusted crawler, and corroborate against the operator's official verification guidance.

How it appears in analytics and logs

A log audit shows the actual requests crawlers made — user agent, URL, status, timestamp. It is the authoritative view of crawl coverage and waste, free of sampling, and the basis for verifying claimed crawlers against published ranges.

Diagnostic use case

Audit a site's crawl by analysing access logs: confirm which crawlers reach which pages, surface status-code patterns, and find where crawl budget is wasted.

What WebmasterID can help detect

WebmasterID classifies requests server-side and surfaces crawler activity without you parsing raw logs, complementing a manual log audit by attributing fetches to known bots and flagging unknown ones.

Common mistakes

Privacy and accuracy notes

A log audit focuses on crawler requests, which are not people. Access logs can contain IPs; treat them as sensitive, never publish raw IPs, and use coarse, aggregate views rather than per-visitor tracking.

Frequently asked questions

Why use server logs instead of analytics for crawl data?
Most crawlers do not execute JavaScript, so they never appear in client-side analytics. Server logs record every request regardless, making them the authoritative source for what crawlers actually fetched.
How do I verify a request really came from Googlebot?
Use Google's published verification method — reverse DNS to a googlebot.com/google.com host, then forward-confirm — or check against Google's published ranges. Never trust the user-agent string alone, and never invent an IP range.

Related pages

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.