HTTP status & crawl-diagnostics reference for webmasters
A reference to HTTP status codes and crawl diagnostics from a webmaster's point of view. Each page explains what a status or symptom means for crawlers, gives an operator action checklist, and links to the bot and robots pages you need to resolve it.
135 diagnostic topics documented · part of the Web Crawler & Traffic Intelligence Encyclopedia.
- HTTP 200 OK: what it means for crawlers
200 OK means the request succeeded and the server returned the resource. For crawlers it is the green light to process and potentially index a page. The subtle trap is the soft 404 — an error or empty page served with a 200 status, which wastes crawl budget and pollutes the index.
- HTTP 404 Not Found: what it means for crawlers
404 Not Found means the server has no resource at that URL. It is the correct, healthy response for genuinely missing pages — crawlers expect some 404s. Problems arise when important pages 404 by accident, when removed pages should signal 410, or when 'not found' pages wrongly return 200.
- HTTP 301 Moved Permanently for crawlers
301 Moved Permanently tells clients and crawlers that a resource has permanently moved to a new URL. It is the standard signal for migrations and URL changes: crawlers follow it, update their index over time, and consolidate ranking signals onto the new location. Use it whenever content has a stable new home.
- HTTP 302 Found (temporary redirect)
302 Found signals a temporary redirect: the resource is briefly available at a different URL, but the original should still be used in future. Because it does not communicate permanence, crawlers keep the original URL. Using 302 for a permanent move is a common diagnostic problem.
- HTTP 304 Not Modified and crawl efficiency
304 Not Modified is the response to a conditional request when the resource has not changed since the client last fetched it. The server returns no body, so the crawler reuses its cached copy. Correct conditional-request support with ETag or Last-Modified saves bandwidth and crawl budget.
- HTTP 307 Temporary Redirect
307 Temporary Redirect is a temporary redirect that, unlike the historically ambiguous 302, guarantees the request method and body are preserved. A POST stays a POST. It signals impermanence, so crawlers keep the original URL while following the detour for the current request.
- HTTP 308 Permanent Redirect
308 Permanent Redirect signals a permanent move while preserving the request method and body. It is the method-preserving counterpart to 301: crawlers follow it, replace the old URL over time, and consolidate signals onto the target — without downgrading a POST to a GET.
- HTTP 400 Bad Request for crawlers
400 Bad Request means the server refused to process the request because it appeared malformed — bad syntax, invalid headers, or a request the server cannot interpret. Seeing 400s for crawlers usually points at malformed URLs, encoding issues, or a misbehaving edge layer rather than the crawler itself.
- HTTP 401 Unauthorized and crawling
401 Unauthorized means the request lacks valid authentication credentials for the resource. Crawlers do not log in, so a page behind a 401 cannot be fetched or indexed. Seeing 401s for content you intended to be public usually means an auth layer is misconfigured or applied too broadly.
- HTTP 403 Forbidden and blocked crawlers
403 Forbidden means the server understood the request but refuses to authorize it, and authenticating will not help. For crawlers, a 403 often signals over-blocking — a WAF, bot-management rule, or IP filter rejecting legitimate crawlers and quietly removing pages from being indexed.
- HTTP 410 Gone vs 404
410 Gone means the resource was intentionally and permanently removed and is not coming back. It is a stronger, more deliberate removal signal than 404, and search engines can treat it as a faster cue to drop the URL. Use 410 when you have purposely retired content for good.
- HTTP 429 Too Many Requests and crawl rate
429 Too Many Requests means the client has sent too many requests in a given time and is being rate limited. It can include a Retry-After header telling the client when to try again. Compliant crawlers slow down in response, making 429 a controlled way to manage crawl rate.
- HTTP 500 and crawl health
500 Internal Server Error is a generic message that something went wrong on the server and it could not complete the request. Occasional 500s happen, but repeated 500s on important URLs harm crawl health: crawlers may slow down and, if errors persist, treat affected pages as unreliable.
- HTTP 502 Bad Gateway
502 Bad Gateway means a server acting as a gateway or proxy received an invalid response from an upstream server it was trying to reach. It points at a problem between layers — origin down, app crash, or a misconfigured proxy — rather than at the requested resource itself.
- HTTP 503 Service Unavailable for maintenance
503 Service Unavailable means the server is temporarily unable to handle the request, usually due to maintenance or overload. It is the correct, index-protecting status for planned downtime: with a Retry-After header, compliant crawlers understand the outage is temporary and come back later.
- Redirect chains and loops
A redirect chain is a sequence of hops (A to B to C) before reaching the final URL; a redirect loop never resolves. Chains waste crawl budget, slow signal consolidation, and can stop crawlers following beyond a hop limit. The fix is to point each source straight at the final destination.
- Canonical mismatch diagnosis
A canonical mismatch happens when your rel=canonical tag points one way while redirects, sitemaps, internal links, or hreflang point another. Conflicting signals confuse which URL should represent a piece of content, so crawlers may pick a canonical you did not intend. Aligning the signals fixes it.
- Crawl budget waste: causes and fixes
Crawl budget is the finite attention a search engine spends on your site. It is wasted when crawlers spend it on low-value URLs — endless faceted combinations, parameter variants, soft 404s, and redirect chains — instead of your important pages. Reducing that waste helps key content get crawled.
- Diagnosing a blocked crawler
When a crawler is not reaching your pages, the block can come from several layers: a robots.txt Disallow, a server-side 403, a WAF or bot-management rule, or an IP filter. Confirming which layer is responsible — rather than guessing — is the key to fixing it without opening doors you meant to keep shut.
- Diagnosing a bot traffic spike
A sudden spike in traffic is often bots, not audience. The diagnostic question is which bots: a verified crawler doing a fresh crawl wave, or spoofers and scrapers impersonating known crawlers. Separating verified crawlers from impostors by user-agent token and verification keeps your human analytics honest.
- Diagnosing an unknown bot
An unknown bot is a client whose user-agent does not match a known crawler. The right response is to verify what you can and resist guessing: attributing an unfamiliar user-agent to a named operator without evidence is how bad data spreads. An honest other bucket is more useful than a confident wrong label.
- HTTP 201 Created and crawlers
201 Created means the request succeeded and resulted in one or more new resources being created, typically in response to a POST or PUT. It is a success status tied to writes, so it is rare for the GET requests crawlers issue. Seeing 201 in crawler logs usually points at an API or form endpoint being fetched, not a normal page.
- HTTP 204 No Content and indexing
204 No Content means the request succeeded but the server intentionally returns no body. It is useful for actions where the client needs no new content, such as a save that updates nothing visible. For a crawler there is nothing to render or index, so a 204 on a URL meant to be a page is a problem.
- HTTP 206 Partial Content and range requests
206 Partial Content is the response to a range request: the client asked for a byte range of a resource and the server returned just that portion. It underpins resumable downloads and media streaming, where players fetch a file in chunks. In crawler logs it usually reflects media or large-file fetching rather than page crawling.
- HTTP 303 See Other and POST-redirect-GET
303 See Other tells the client to fetch a different URL with a GET, regardless of the original request method. It is the backbone of the POST-redirect-GET pattern, sending a browser to a result page after a form submission so a refresh does not resubmit. Because it forces GET, it differs from method-preserving 307.
- HTTP 308 vs 301 for SEO
301 and 308 both signal a permanent move and let search engines consolidate signals onto the new URL. The difference is method handling: 301 has historically been treated loosely, while 308 strictly preserves the request method and body. For ordinary GET page moves either works; 308 is safer when the method must not change.
- HTTP 405 Method Not Allowed
405 Method Not Allowed means the server recognises the request method but the target resource does not support it — for example a POST to a GET-only page. For crawlers, which issue GET (and sometimes HEAD), a 405 usually means the route does not allow GET, often a misconfiguration on a URL that should serve a page.
- HTTP 408 Request Timeout
408 Request Timeout means the server timed out waiting for the client to finish sending its request. It points at a slow or stalled connection rather than a problem with the resource. Compliant crawlers generally retry, so occasional 408s are tolerated, but a pattern can indicate network or origin slowness worth investigating.
- HTTP 409 Conflict
409 Conflict means the request could not be completed because it conflicts with the current state of the target resource — for example a concurrent edit or a version mismatch. It arises in write and API workflows, not in the GET fetches crawlers issue, so a 409 in crawler logs usually points at an action endpoint being reached.
- HTTP 451 Unavailable For Legal Reasons
451 Unavailable For Legal Reasons means access is denied because of a legal demand, such as a court order or government censorship. It is a deliberate, lawful block rather than a technical failure. For crawlers it is an access denial like 403, so the content cannot be fetched or indexed while the 451 stands.
- HTTP 501 Not Implemented
501 Not Implemented means the server does not support the functionality required to fulfil the request — typically an HTTP method it does not recognise or handle at all. It is a server-side error distinct from 405, where the resource exists but rejects a specific method. For crawlers it is an uncommon, server-level failure.
- HTTP 504 Gateway Timeout
504 Gateway Timeout means a server acting as a gateway or proxy did not receive a timely response from the upstream server it needed to reach. Unlike 502 (an invalid upstream response), 504 is specifically about the upstream being too slow or unreachable. Persistent 504s degrade crawl health much like sustained 5xx errors.
- HTTP 103 Early Hints and performance
103 Early Hints is an informational status that lets a server send hints — typically Link headers for preloading or preconnecting — before the final response is ready. Browsers can start fetching critical assets earlier, improving load time. It is a performance optimisation that sits ahead of the eventual 200 a crawler processes.
- Soft 404 diagnosis and fixes
A soft 404 is a page that is effectively missing or empty but returns a 200 status, so it looks successful to crawlers while offering no real content. Search engines try to detect them, but you should not rely on that. Soft 404s waste crawl budget and can clutter the index with low-value URLs.
- www vs non-www canonicalization
To a crawler, https://www.example.com and https://example.com are distinct URLs that can serve the same content, creating duplication. The fix is to choose one canonical host, redirect the other to it with a 301, and keep internal links, sitemaps, and canonical tags consistent with the chosen version.
- HTTP vs HTTPS canonicalization
https://example.com and http://example.com are different URLs, so serving content on both creates duplication and mixed signals. The standard fix is to force HTTPS: 301-redirect HTTP to HTTPS, reference only HTTPS in links, sitemaps, and canonicals, and use HSTS so clients default to the secure scheme.
- Trailing slash and duplicate URLs
A trailing slash can make /page and /page/ two distinct URLs serving the same content, creating duplication. Servers and frameworks differ in how they treat the slash, so the fix is to choose one form, 301-redirect the other to it, and keep links, sitemaps, and canonicals consistent.
- Infinite redirect loops
An infinite redirect loop occurs when URL A redirects to B which redirects back to A (directly or through a cycle), so the request never reaches a final response. Browsers and crawlers stop after a few hops and report an error. Loops make pages completely unreachable, blocking both users and indexing.
- Faceted navigation crawl traps
Faceted navigation — filters for size, colour, price, and so on — can combine into a near-infinite number of parameterised URLs. Crawlers can get stuck fetching these low-value combinations, a crawl trap that burns budget on duplicates. Managing it relies on robots.txt rules, canonical tags, and controlling which combinations are linked.
- Pagination and crawling
Paginated series — listings split across page 1, 2, 3 — affect how deep crawlers go and how content is discovered. Google once used rel=next/prev as a pagination signal but stopped using it; current practice relies on crawlable links, sensible URLs, and keeping important content within reachable crawl depth.
- JavaScript rendering and crawling
Content injected by JavaScript is not in the initial HTML, so a crawler must render the page to see it. Rendering is more expensive than fetching HTML, and not all crawlers render. Server-side rendering (SSR) or prerendering puts content in the HTML directly, reducing dependence on the crawler's render step.
- Server log analysis for crawlers
Server logs record every request, making them the most reliable record of what crawlers actually fetched, when, and with what status. Analysing them reveals crawl coverage, errors, and waste that analytics tools miss. Doing it well means verifying claimed bots rather than trusting user-agents, and handling log data in a privacy-safe way.
- Crawl rate and server load
When crawlers request pages faster than your origin can comfortably serve, load rises. Compliant crawlers respond to 429 and 503 with Retry-After by slowing down, giving you a controlled way to protect the server. Google adjusts crawl rate automatically based on site responsiveness and offers a way to report rate problems.
- Orphan pages diagnosis
An orphan page is one that no internal link points to. Crawlers discover pages mainly by following links, so an orphan is hard to find — it may exist only in a sitemap or be effectively invisible. Diagnosing orphans means comparing all known URLs against your internal link graph and fixing the gap with links.
- Duplicate content diagnosis
Duplicate content is the same or very similar content available at multiple URLs. It is not a penalty — Google says so — but it does split signals and waste crawl budget, and search engines must pick one URL to show. Canonical tags, consistent linking, and parameter handling consolidate duplicates onto a preferred URL.
- HTTP 300 Multiple Choices: crawler handling
HTTP 300 Multiple Choices signals that the requested resource has more than one representation and the client (or user) should pick one, optionally guided by a Location header pointing at a preferred choice. It is defined in RFC 9110 but is almost never used in practice for web pages, because there is no standard machine-readable format for the choices. For SEO and crawling, a 300 is ambiguous: most crawlers cannot reliably follow it, so a concrete redirect is preferable.
- HTTP 402 Payment Required
HTTP 402 Payment Required is defined by RFC 9110 as reserved for future use, with no standardised semantics across the web. Some payment platforms and APIs repurpose it to signal that a request cannot proceed until payment is made, but there is no interoperable contract behind it. For crawlers it is a 4xx client error, so a page behind a 402 is generally not indexed.
- HTTP 407 Proxy Authentication Required
HTTP 407 Proxy Authentication Required is like 401, but the authentication is demanded by a proxy between the client and the server rather than by the origin. The proxy returns a Proxy-Authenticate header describing the challenge, and the client must resend with Proxy-Authorization. It almost never originates from your own web server, so seeing it usually points at network or proxy configuration rather than your site.
- HTTP 411 Length Required
HTTP 411 Length Required is returned when the server refuses to accept a request that does not define a Content-Length header. It applies to requests with a body — typically POST or PUT — where the server insists on a declared length rather than a chunked or undefined one. GET requests from crawlers have no body, so crawlers essentially never see a 411 on normal page fetches.
- HTTP 412 Precondition Failed
HTTP 412 Precondition Failed is returned when one or more preconditions in the request headers — such as If-Match or If-Unmodified-Since — evaluate to false against the current resource. It is the negative outcome of conditional requests used to avoid lost updates. It is distinct from 304 Not Modified, which is the cache-validation outcome for conditional GETs that crawlers rely on.
- HTTP 413 Content Too Large
HTTP 413 Content Too Large — renamed from Payload Too Large in RFC 9110 — is returned when the request body is larger than the server is willing or able to process. It commonly appears on upload endpoints and large POST bodies. The server may include a Retry-After header if the condition is temporary. Read-only crawler GETs carry no body, so this status does not affect normal page indexing.
- HTTP 414 URI Too Long
HTTP 414 URI Too Long is returned when the request-target URI is longer than the server is willing to interpret. It often comes from query strings that have grown unbounded — for example a GET form that should have been a POST, parameters appended in a loop, or a redirect that keeps stacking parameters. For crawling, over-long parameterised URLs can both waste crawl budget and trip 414s.
- HTTP 415 Unsupported Media Type
HTTP 415 Unsupported Media Type is returned when the origin refuses a request because its payload is in a format the target resource does not support — for example sending XML to an endpoint that only accepts JSON, or omitting the Content-Type header. It is a request-format error on the write path, so read-only page crawling does not normally produce it.
- HTTP 422 Unprocessable Entity
HTTP 422 Unprocessable Entity (originally from WebDAV, RFC 4918, and listed in the IANA registry) means the server understood the request's content type and syntax but cannot process the contained instructions due to semantic errors — for example a valid JSON body that fails business-rule validation. It is widely used by APIs to signal validation failures, sitting between syntactic 400 and successful processing.
- HTTP 431 Request Header Fields Too Large
HTTP 431 Request Header Fields Too Large, defined in RFC 6585, is returned when the server refuses a request because the header section — either a single field or the total — is too big. A frequent cause is an oversized or accumulated cookie. The server can indicate which header caused the problem so the client can reduce it.
- HTTP 510 Not Extended
HTTP 510 Not Extended comes from RFC 2774, an experimental specification for an HTTP extension framework. It signals that the server requires further extensions to the request before it will fulfil it. The mechanism saw little adoption, so 510 is rare in practice. As a 5xx code, crawlers treat it as a server error and will not index the URL while it persists.
- HTTP 511 Network Authentication Required
HTTP 511 Network Authentication Required, from RFC 6585, is intended for use by intercepting proxies — captive portals — that need the client to authenticate before granting network access. It is deliberately not meant to be sent by origin servers. Its purpose is to give clients a machine-detectable signal that they are behind a captive portal rather than talking to the real site.
- Cloudflare 520 (Unknown Error)
HTTP 520 is a Cloudflare-specific status code, not part of any IANA/RFC standard. Cloudflare returns 520 when the origin server returns an empty, unknown, or otherwise unexpected response that Cloudflare cannot interpret. It is a catch-all for connection issues between Cloudflare and the origin, and it points to the origin or the connection, not to Cloudflare itself.
- Cloudflare 521 (Web Server Is Down)
HTTP 521 is a Cloudflare-specific status, not an IANA/RFC standard. Cloudflare returns 521 when it cannot establish a TCP connection to the origin — the origin actively refused the connection or is down. A frequent cause is the origin firewall blocking Cloudflare's IP ranges, or the web server process being stopped. It points squarely at origin reachability.
- Cloudflare 522 (Connection Timed Out)
HTTP 522 is a Cloudflare-specific status, not part of the IANA/RFC standards. Cloudflare returns 522 when the TCP connection to the origin timed out before it could be established — Cloudflare reached out but the origin did not complete the handshake in time. It usually reflects an overloaded origin, network/routing problems, or a firewall silently dropping packets.
- Cloudflare 524 (A Timeout Occurred)
HTTP 524 is a Cloudflare-specific status, not an IANA/RFC standard. Cloudflare returns 524 when it successfully connected to the origin but the origin did not return an HTTP response within Cloudflare's time limit. Unlike 522 (the connection itself timed out), 524 means the connection succeeded but the response was too slow — typically a long-running request on the origin.
- Diagnosing hreflang errors
hreflang annotations tell search engines which language and regional URL to show to which users. They are easy to get subtly wrong: return tags that are not reciprocal, invalid language or region codes, hreflang pointing at non-canonical or redirecting URLs, or a missing self-reference. These errors cause search engines to ignore the cluster, so the wrong-language page can surface for users.
- Diagnosing XML sitemap errors
An XML sitemap helps search engines discover and prioritise your URLs, but a sitemap full of the wrong URLs sends mixed signals. Common errors include listing redirecting or non-200 URLs, including noindex or canonicalised-away pages, exceeding the 50,000-URL or 50 MB limits, or referencing the wrong protocol/host. A clean sitemap lists only canonical, indexable, 200-returning URLs.
- Diagnosing index bloat
Index bloat is when a site has far more URLs indexed than it has genuinely valuable, distinct pages. It comes from faceted-navigation variants, tracking parameters, paginated and filtered duplicates, thin or auto-generated pages, and internal search results. Bloat dilutes crawl attention and can bury your important pages among low-value ones. Diagnosis means comparing indexed counts to your real page inventory.
- Render-blocking resources and crawling
Render-blocking resources are scripts and stylesheets the browser must fetch and process before it can display a page. They slow the first paint for users and add work when search engines render pages to evaluate content. Reducing render-blocking — deferring non-critical JavaScript, inlining critical CSS, and minimising blocking requests — speeds rendering for both visitors and crawlers.
- Diagnosing structured data errors
Structured data (schema.org markup, usually as JSON-LD) lets search engines understand a page and can make it eligible for rich results. Errors — missing required properties, invalid types or values, markup that does not match visible content, or policy violations — can make a page ineligible for those features. Diagnosis uses validators and Search Console's rich-result reports.
- Mobile usability and mobile-first crawling
Google uses mobile-first indexing: it predominantly crawls and indexes the mobile version of a site with a smartphone crawler. If the mobile version is missing content, structured data, or images that the desktop version has, those can be lost from the index. Mobile usability problems — tiny tap targets, content wider than the screen, unreadable text — degrade the experience the mobile crawler evaluates.
- Noindex but heavily linked: a diagnosis
A noindex page that is still prominently linked across the site is a common, subtle conflict: you are telling search engines not to index a page while structurally treating it as important. Either the noindex is a mistake on a page you want indexed, or the heavy linking wastes internal link equity on a page you have chosen to keep out of the index. Diagnosis is about resolving the contradiction.
- HTTP 226 IM Used
HTTP 226 IM Used is a rare success status from RFC 3229 (Delta encoding in HTTP). The server has fulfilled a GET request and the response is one or more instance-manipulations applied to the current instance — most commonly a delta against a version the client already holds. It is almost never seen in ordinary crawling and signals a specialised content-negotiation feature is in play.
- HTTP 207 Multi-Status
HTTP 207 Multi-Status comes from RFC 4918 (WebDAV). Instead of one status for the whole request, the server returns a 207 with an XML multistatus body that reports a separate status for each affected resource. It is used when a single request touches multiple resources that can succeed or fail independently. It is a WebDAV/API response, not something search crawlers expect on content pages.
- HTTP 208 Already Reported
HTTP 208 Already Reported is defined by RFC 5842, an extension to WebDAV for bindings. It is used inside a 207 Multi-Status response to tell the client that a resource's members were already enumerated in a previous part of the response, so they are not listed again. It prevents infinite or repeated enumeration when bindings create multiple paths to the same collection.
- HTTP 416 Range Not Satisfiable
HTTP 416 Range Not Satisfiable (RFC 9110) is returned when a request includes a Range header whose ranges all fall outside the resource's current size — for example asking for bytes starting past the end of the file. The server cannot return the requested range and responds 416, usually with a Content-Range header stating the resource's total length.
- HTTP 417 Expectation Failed
HTTP 417 Expectation Failed (RFC 9110) is returned when the expectation in a request's Expect header cannot be met by the server or an intermediary. In practice the Expect header is almost always 'Expect: 100-continue', a handshake clients use before sending a large body. A 417 usually points to a proxy or server that does not support that handshake.
- HTTP 418 I'm a Teapot
HTTP 418 I'm a Teapot originates from RFC 2324, the Hyper Text Coffee Pot Control Protocol — an April Fools' joke from 1998. It is not part of the core HTTP specification, but the code number 418 is reserved by IANA so it will not be reused. Some sites and APIs return it deliberately as a humorous or bot-deterrent refusal; it has no defined production semantics.
- HTTP 421 Misdirected Request
HTTP 421 Misdirected Request (RFC 9110) is returned when a server receives a request directed at an authority (host) it cannot or is unwilling to produce a response for over the current connection. It frequently arises with HTTP/2 connection coalescing, where a client reuses one TLS connection for multiple hostnames that share a certificate but are not all served by that backend.
- HTTP 425 Too Early
HTTP 425 Too Early comes from RFC 8470, which governs using early data in HTTP. TLS 1.3 0-RTT early data lets a client send request data before the handshake completes, but such data can be replayed by an attacker. A server returns 425 to indicate it is unwilling to process a request that arrived in early data and asks the client to retry once the handshake is complete.
- HTTP 426 Upgrade Required
HTTP 426 Upgrade Required (RFC 9110) is returned when a server refuses to process a request on the current protocol and requires the client to upgrade. The response must include an Upgrade header naming the required protocol and a Connection: Upgrade header. A common use is insisting clients move to a newer protocol or to TLS before the request can proceed.
- HTTP 428 Precondition Required
HTTP 428 Precondition Required comes from RFC 6585. It lets a server require that a request be conditional, typically demanding an If-Match or If-Unmodified-Since header on a write. The goal is to avoid the lost-update problem: two clients fetch a resource, both modify it, and the second overwrites the first. By requiring a precondition, the server rejects blind writes with 428.
- HTTP 507 Insufficient Storage
HTTP 507 Insufficient Storage is a server-error status from RFC 4918 (WebDAV). It means the method could not be performed because the server is unable to store the representation needed to complete the request — for example a write that would exceed available disk space or a storage quota. It is a 5xx, so crawlers treat the URL as temporarily failing.
- HTTP 508 Loop Detected
HTTP 508 Loop Detected comes from RFC 5842, the WebDAV binding extensions. It tells the client the server terminated an operation because it detected an infinite loop while processing — typically a Depth: infinity request over a collection whose bindings create a cycle back into themselves. It is a server-side safety stop that prevents endless recursion.
- HTTP status code cheat sheet for crawlers
This cheat sheet maps the five HTTP status classes to what they mean for crawlers and indexing. It is a quick reference for reading server logs and Search Console crawl data: which codes index normally, which redirect, which signal client errors, and which are server failures crawlers will retry. The aim is to interpret status codes through a crawl-and-index lens rather than a generic one.
- Auditing crawls with server log files
A server log file crawl audit reads raw access logs to see exactly how crawlers interact with your site: which URLs each bot fetched, what status codes they received, how often, and how much of your crawl is spent on low-value paths. Because logs record every request server-side, they reveal crawl behaviour that JavaScript analytics and sampled reports cannot — the ground truth of who fetched what.
- Analysing the Search Console Crawl Stats report
The Crawl Stats report in Google Search Console (under Settings) shows how Googlebot crawled your site over the last 90 days: total crawl requests, total download size, average response time, and breakdowns by response code, file type, crawl purpose (discovery vs refresh), and Googlebot type. Reading it well tells you whether crawling is healthy and where it is being wasted.
- Using the URL Inspection tool
The URL Inspection tool in Google Search Console reports, for one URL, whether it is indexed, when Google last crawled it, which canonical Google chose, and any coverage or enhancement issues. Its live test fetches the URL in real time and shows the rendered HTML, loaded resources, and any crawl errors — making it the fastest way to diagnose why a specific page is or is not in the index.
- Reading the Page Indexing (Coverage) report
The Page Indexing report (formerly Index Coverage) in Google Search Console shows how many of your pages are indexed and groups the not-indexed pages by reason — such as crawled-not-indexed, discovered-not-indexed, duplicate without user-selected canonical, excluded by noindex, blocked by robots.txt, redirect, or soft 404. Each reason points to a distinct fix.
- Google Indexing API: scope and limits
Google's Indexing API lets sites notify Google when pages with specific structured data are added or removed so they can be crawled quickly. Google documents it for pages with JobPosting or BroadcastEvent (livestream) structured data only. It is not a general indexing shortcut for ordinary content; using it outside its documented scope is unsupported and ineffective.
- Fetch and render: how Google sees your page
Google crawls a page, then renders it with a headless Chromium-based engine before indexing, so the indexed content is the rendered DOM, not just the raw HTML. The old standalone Fetch as Google tool has been folded into the URL Inspection live test, which shows the rendered HTML, a screenshot, and loaded resources. Differences between raw and rendered output explain many JavaScript indexing problems.
- Monitoring crawl errors over time
Monitoring crawl errors means watching, over time, the rate and type of failures crawlers encounter: rising 404s, new 5xx spikes, redirect chains, robots.txt fetch failures, and host-status problems. Caught early through Search Console reports, server logs, and uptime checks, these are cheap to fix; caught late, after pages drop from the index, they are costly. The goal is trend detection, not one-off checks.
- Redirect best practices for crawlers
Good redirects keep crawlers and link equity flowing to the right destination. The essentials: use 301 or 308 for permanent moves and 302 or 307 only for genuinely temporary ones, redirect each old URL directly to its final target in a single hop, map to the closest equivalent page rather than dumping everything on the homepage, and avoid loops. These choices preserve indexing signals and conserve crawl budget.
- How CDNs interact with crawlers
A CDN sits between crawlers and your origin, so it shapes what crawlers see: cached responses, edge-served status codes, bot-management challenges, and region-specific edges. Configured well, a CDN speeds crawling and absorbs load; configured poorly, it can block legitimate crawlers, serve stale or wrong content, or return CDN-specific errors that look like origin problems. Understanding the interaction prevents silent crawl failures.
- Rate-limiting crawlers without losing indexing
When a crawler is overloading your server, the goal is to slow it without telling search engines your content is gone. Safe techniques include returning 503 or 429 with a Retry-After header for short-term overload, using crawl-delay only where a crawler honours it (Googlebot does not), and adjusting settings where the operator provides them. Blunt blocks or long outages risk deindexing, so rate-limit deliberately.
- Fixing 'Indexed, though blocked by robots.txt'
A URL disallowed in robots.txt can still appear in Google's index if other pages link to it — Google may index the URL (often with no useful snippet) without crawling it. The trap is that a noindex tag on that page cannot be seen, because robots.txt stops Google fetching the page to read the tag. The fix is to allow crawling and use noindex, or to remove the link signals.
- Core Web Vitals and crawling
Core Web Vitals are Google's three user-centric performance metrics — Largest Contentful Paint, Cumulative Layout Shift, and Interaction to Next Paint. They are a page experience signal used in ranking, but they do not gate crawling or indexing: a slow page can still be crawled and indexed. This page explains how vitals are measured in the field versus the lab and where they fit in the crawl-to-rank pipeline.
- The hreflang x-default value
x-default is a special hreflang value that names the page to serve when no other language or region annotation matches the user. It is the fallback in an hreflang set — often a language selector, a global homepage, or a generic version. This page covers when x-default is appropriate, how it interacts with the rest of the cluster, and the return-tag and self-reference rules that keep it valid.
- XML sitemap best practices
An XML sitemap lists URLs you want crawled, helping search engines discover pages they might miss through links alone. The format has firm limits — 50,000 URLs and 50MB uncompressed per file — and works best when it contains only canonical, indexable, 200-status URLs with accurate lastmod values. This page covers the documented rules and the common quality problems that make a sitemap less useful.
- Client-side rendering and SEO
Client-side rendering (CSR) sends a thin HTML shell and builds the page in the browser with JavaScript. Googlebot can render JavaScript, but it does so in a deferred second pass, and content that depends entirely on client-side execution is more fragile to crawl and index than server-rendered HTML. This page explains how Google processes CSR, where it commonly fails, and safer alternatives.
- Canonical tag best practices
The rel=canonical annotation tells search engines which URL is the preferred version of duplicate or near-duplicate content, consolidating signals onto one URL. It is a strong hint, not a directive — Google may choose a different canonical if other signals disagree. This page covers correct implementation: self-referencing canonicals, absolute URLs, consistency with sitemaps and internal links, and the mistakes that send conflicting signals.
- Internal linking for crawl discovery
Internal links are how crawlers discover and reach pages within a site. Google primarily finds new URLs by following links, so pages with no incoming internal links become orphans that are hard to discover. This page explains crawl depth, link equity flow, and practical patterns — hub pages, breadcrumbs, related links, and crawlable HTML anchors — that keep important pages within easy reach of a crawl.
- Largest Contentful Paint (LCP) diagnosis
Largest Contentful Paint (LCP) measures when the largest content element in the viewport finishes rendering — a proxy for how quickly the main content becomes visible. It is one of the three Core Web Vitals. A good LCP is 2.5 seconds or less; the most common culprits are slow server response, render-blocking resources, slow resource load, and client-side rendering delays. This page breaks down the metric and its diagnosis.
- Cumulative Layout Shift (CLS) diagnosis
Cumulative Layout Shift (CLS) measures visual stability — how much page content unexpectedly moves while loading. It is one of the three Core Web Vitals. A good CLS is 0.1 or less; the usual causes are images and ads without reserved space, late-loading fonts, and content injected above existing elements. This page explains the metric, its thresholds, and the diagnosis-and-fix workflow.
- Interaction to Next Paint (INP) diagnosis
Interaction to Next Paint (INP) measures overall page responsiveness by observing the latency of user interactions throughout a visit and reporting a representative worst case. It became a Core Web Vital in March 2024, replacing First Input Delay. A good INP is 200ms or less; long tasks and heavy JavaScript on the main thread are the usual causes. This page covers the metric and its diagnosis.
- Server-side rendering and SEO
Server-side rendering (SSR) generates a page's HTML on the server for each request, so the main content is present in the initial response before any client JavaScript runs. This makes content immediately available to every crawler, including those that do not execute JavaScript, and avoids reliance on Google's deferred render pass. This page explains SSR's crawl benefits, its costs, and how it relates to static rendering and hydration.
- Dynamic rendering and why Google deprecated it
Dynamic rendering served pre-rendered HTML to crawlers and client-rendered content to users, as a stopgap for JavaScript-heavy sites. Google now describes it as a workaround, not a long-term recommendation, and steers sites toward server-side or static rendering with hydration. This page explains what dynamic rendering did, the maintenance and cloaking risks, and the modern alternatives.
- Hydration and crawling
Hydration is the process of attaching client-side JavaScript behavior to HTML that was already rendered on the server or statically. For crawling, the key point is that the content is in the initial HTML, so it is visible to crawlers regardless of hydration; hydration mainly affects interactivity and responsiveness. This page explains hydration, its crawl implications, and the INP trade-offs of heavy hydration.
- Article structured data
Article structured data (Article, NewsArticle, BlogPosting from schema.org) marks up news, blog, and editorial pages so Google can better understand and present them, including in features like Top stories. This page covers the type choice, the properties Google recommends (headline, image, dates, author), and how to validate the markup with the Rich Results Test and Search Console.
- Product structured data
Product structured data uses schema.org Product with a nested Offer and optionally aggregate-rating and review data to describe items for sale, enabling product rich results such as price, availability, and review snippets. This page covers required and recommended fields, Google's policies on review data, and how to validate the markup with the Rich Results Test and Search Console.
- FAQ structured data
FAQ structured data uses schema.org FAQPage to mark up a list of questions and their answers. Note that Google narrowed FAQ rich-result eligibility in 2023 to well-known authoritative government and health sites, so most sites no longer get the visual rich result. This page explains correct FAQPage markup, the eligibility change, and how to validate it.
- Breadcrumb structured data
Breadcrumb structured data uses schema.org BreadcrumbList to describe the trail of pages leading to the current page, helping Google show a breadcrumb path in search results instead of a plain URL. This page covers the ItemList structure, the position and item properties Google requires, multiple-trail handling, and validation.
- Review and aggregate-rating structured data
Review structured data (the schema.org review and aggregate-rating types) can produce review star snippets in search results when attached to a supported item such as a Product, Recipe, or Book. Google enforces strict policies: ratings must come from genuine reviews, self-serving reviews of your own business are not eligible, and only certain schema types support the snippet. This page explains correct usage and the rules.
- Sitemap index files
A sitemap index file is a sitemap that lists other sitemaps, letting large sites stay within the 50,000-URL and 50MB-per-file limits while exposing all URLs through one submitted entry point. This page explains the sitemapindex format, the same per-file limits that apply to the index itself, and best practices for organizing and submitting multiple sitemaps.
- Image sitemaps
Image sitemap information uses Google's image sitemap extension to list images associated with a page, helping Google discover images it might not otherwise find — for example those loaded via JavaScript or referenced in CSS. This page covers the image namespace, the per-page image limit, and when image sitemap data is worth adding.
- Video sitemaps
Video sitemap information uses Google's video sitemap extension to describe videos on a page — title, description, thumbnail, and either a content or player URL — so Google can discover and understand them for video features. This page covers the required video namespace tags, the relationship to VideoObject structured data, and common pitfalls.
- News sitemaps
A News sitemap uses Google's news sitemap extension to help Google News discover recent articles. It is specialized: include only articles published in the last two days, limit it to 1,000 URLs, and update it as new articles appear. This page covers the news namespace tags, the constraints, and how News sitemaps differ from standard sitemaps.
- Sitemap lastmod accuracy
The lastmod element in a sitemap reports when a URL's content last changed. Google uses lastmod to prioritize recrawling only when the value is consistently accurate; if every URL shows the generation date or the homepage date, Google learns to distrust and ignore it. This page explains correct lastmod semantics, format, and the consequences of inaccuracy.
- Hreflang return-tag errors
The hreflang return-tag rule requires that every URL in a language cluster references every other URL, and that each referenced URL points back. A missing back-reference is a no-return-tag error, which invalidates that pairing and is reported in Search Console. This page explains reciprocity, self-referencing, single-method consistency, and how to find and fix return-tag problems.
- Server response time and crawling
Server response time directly affects how much Google can crawl. Googlebot adjusts its crawl rate to avoid overloading a server, so consistently slow responses reduce the number of pages it fetches. Persistent slowness or 5xx errors cause Google to back off. This page explains the crawl-rate-versus-response-time relationship, its connection to time-to-first-byte, and how to keep responses fast under crawl load.
- Nofollow and crawling
rel=nofollow tells search engines you do not vouch for a link. Since 2019 Google treats nofollow (and the related sponsored and ugc values) as hints rather than strict directives for crawling and indexing. This page explains the link attributes, why nofollow is not a reliable way to control crawling, and how it differs from robots.txt and noindex.
- The IndexNow protocol
IndexNow is an open protocol that lets a site notify participating search engines (including Microsoft Bing and Yandex) the moment a URL is added, updated, or deleted. You submit URLs with a shared key file hosted on your domain; one ping is shared across participating engines. It complements XML sitemaps but does not replace them, and it does not guarantee indexing — it only signals that a recrawl may be worthwhile.
- Bing URL submission and Webmaster Tools
Bing Webmaster Tools lets verified site owners submit URLs to encourage Bing to crawl them, both manually and through a submission API, subject to per-site daily quotas. Bing also supports IndexNow for the same discovery purpose. Submission is a discovery hint, not an indexing guarantee, and quotas scale with the site rather than being unlimited.
- HTTP/2 and HTTP/3 and crawling
Googlebot supports crawling over HTTP/2 where it is beneficial, and HTTP/2 and HTTP/3 improve connection efficiency through multiplexing and reduced overhead. Switching transport does not by itself change rankings, but it can make crawling more efficient and reduce server load. Google may crawl over HTTP/2 or fall back to HTTP/1.1 depending on what the server supports and what is efficient.
- Time to first byte (TTFB) and crawl health
Time to first byte (TTFB) measures how long the server takes to start sending a response. High TTFB slows every fetch and, when sustained, can cause Googlebot to crawl more conservatively because slow responses signal the server is under strain. TTFB is a server-and-network metric distinct from rendering metrics like LCP, and improving it benefits both crawlers and users.
- Edge rendering and SEO
Edge rendering runs page assembly or rendering at CDN points of presence close to the requester, lowering latency for crawlers and users. It can improve TTFB and crawl efficiency, but the SEO requirement is unchanged: the response the crawler receives must contain the complete, indexable content. Edge logic that branches on geography, cookies, or headers can accidentally serve crawlers a different or incomplete page.
- Security headers (CSP/HSTS) and crawling
Security headers such as HTTP Strict-Transport-Security (HSTS) and Content-Security-Policy (CSP) harden a site against attacks, but they interact with crawling and rendering. HSTS pushes everyone, including crawlers, to HTTPS. An over-restrictive CSP can block the scripts, styles, or fonts a rendering crawler loads, producing a rendered page that differs from what users see. Headers are not a substitute for robots controls.
- Mixed content and crawlability
Mixed content occurs when an HTTPS page loads subresources over insecure HTTP. Modern browsers block active mixed content (scripts, stylesheets, iframes) and increasingly upgrade or block passive mixed content too. A rendering crawler behaves similarly, so mixed content can leave the indexed, rendered page missing scripts, styles, or images. Fixing it means serving every subresource over HTTPS.
- Meta refresh redirects and crawlers
A meta refresh redirect uses an HTML meta tag to send the browser to another URL after a delay. Google can follow meta refreshes, but its guidance is to prefer server-side HTTP redirects (301/302) because they are clearer, faster, and unambiguous. An instant (zero-delay) meta refresh is treated more like a redirect, while a delayed one is weaker and can confuse users and crawlers.
- JavaScript redirects and crawling
A JavaScript redirect changes the location in script (for example via window.location) and only executes after the page is fetched and rendered. Google can follow JavaScript redirects once it renders the page, but its guidance is to prefer server-side HTTP redirects because they are processed immediately and unambiguously. JavaScript redirects add latency and depend on successful rendering.
- Lazy loading and crawlability
Lazy loading defers loading images, iframes, or content until they are near the viewport, improving performance. The crawl risk is that content which only loads on scroll or interaction may never load for a crawler that does not scroll like a user. Google recommends native lazy-loading (loading=lazy) or implementations that make deferred content discoverable, and verifying that lazy content appears in the rendered HTML.
- Field (RUM) vs lab data for crawl health
Field data (real-user monitoring) captures performance experienced by actual visitors over time, while lab data comes from a single controlled synthetic test. Google's Core Web Vitals assessment for Search uses field data from real users, not lab scores. Lab tools are for debugging; field data is the verdict. Crawlers experience something closer to a cold lab fetch, so neither dataset alone fully describes crawl-time performance.
- Crawl anomaly detection
Crawl anomaly detection means watching crawl volume, response codes, and crawl timing for unexpected changes — a sharp drop in crawled pages, a surge in 5xx errors, a spike in requests to a single path, or crawling of URLs that should not exist. The Crawl Stats report and server logs are the primary data. Anomalies usually trace to server health, a misconfiguration, or a crawl trap rather than a ranking event.
- Accelerated indexing myths
A common myth is that submitting URLs — via sitemaps, IndexNow, the URL Inspection tool, or third-party services — forces instant indexing. In reality these speed discovery; the indexing decision still depends on Google's quality assessment, duplication checks, and crawl budget. Google's Indexing API is limited to specific content types (job postings and livestream structured data), not general pages. There is no documented way to guarantee instant indexing.
- AMP deprecation and crawling
Google removed the AMP requirement for the Top Stories carousel and retired the AMP badge in Search, so AMP is no longer a prerequisite for those features. Sites moving off AMP must handle the transition carefully: redirect AMP URLs to canonical pages, update canonical and sitemap signals, and ensure the non-AMP page is fast and indexable so crawling and rankings are preserved.
- Infinite scroll and crawling
Infinite scroll appends new content as a user scrolls, which is good UX but hides content from crawlers that do not scroll. Google's guidance is to support infinite scroll with crawlable, paginated URLs — each chunk of content reachable at its own URL — so crawlers can discover everything via links, not scroll events. Without paginated URLs behind it, content beyond the first load may never be indexed.
- Cookie walls, consent banners, and crawling
Cookie walls and consent banners gate access until a visitor responds. The crawl risk is twofold: an interstitial that hides the page content from crawlers, and consent logic that blocks scripts or resources the rendering crawler needs. Crawlers do not click consent buttons, so content reachable only after consent may be invisible. Keep the indexable content accessible and ensure the banner does not strip the rendered page.
- Viewport meta tag and mobile crawling
With mobile-first indexing, Google predominantly crawls and indexes using the smartphone Googlebot, so the mobile rendering of a page is what matters. A correct viewport meta tag (width=device-width, initial-scale=1) is required for responsive design to render at the right width; a missing or wrong viewport causes the page to render as a scaled-down desktop layout, hurting mobile usability and how the page is assessed.
- Pre-rendering services and crawling
Pre-rendering generates a static HTML snapshot of a JavaScript-heavy page and serves it to crawlers, so they receive fully rendered content without executing the app. Google has documented dynamic rendering using this approach as a workaround, though it now describes it as a workaround rather than a long-term recommendation, favoring server-side rendering or hydration. Key risks are snapshot staleness and content parity with what users see.
Other reference hubs
See how WebmasterID applies this in product: Bot intelligence, AI referrals, and AI visibility analytics.