Crawl diagnostics

Diagnosing XML sitemap errors

An XML sitemap helps search engines discover and prioritise your URLs, but a sitemap full of the wrong URLs sends mixed signals. Common errors include listing redirecting or non-200 URLs, including noindex or canonicalised-away pages, exceeding the 50,000-URL or 50 MB limits, or referencing the wrong protocol/host. A clean sitemap lists only canonical, indexable, 200-returning URLs.

Verified against primary sources

What a sitemap is for

An XML sitemap is a list of URLs you want search engines to know about, optionally with lastmod hints. It aids discovery, especially for large sites, new pages, or pages with few internal links. It is a suggestion, not a guarantee of crawling or indexing.

The sitemaps.org format and Google's documentation set the rules: a single sitemap file may contain up to 50,000 URLs and must not exceed 50 MB uncompressed; larger sites split into multiple files referenced by a sitemap index.

Common sitemap errors

The most damaging errors are including URLs that should not be there: pages that 301-redirect, return 404/410, are blocked by robots.txt, carry a noindex, or are canonicalised to a different URL. Each contradicts the 'index this' implication of listing it.

Other faults include exceeding the size limits without splitting, mixing http and https or www and non-www inconsistently with your canonical host, malformed XML, and stale lastmod values that never change. Referencing URLs on a different domain than the sitemap's host is also rejected.

Listing redirecting or non-200 URLs
Including noindex or canonicalised-away pages
Exceeding 50,000 URLs or 50 MB per file without splitting
Inconsistent protocol/host vs your canonical
Malformed XML or stale lastmod values

How to diagnose and fix

Generate the sitemap from your canonical, indexable URL set only, so it cannot drift from what you actually want indexed. Validate that every listed URL returns 200 and matches its own canonical. Keep lastmod honest — only update it when the content meaningfully changes.

Submit the sitemap in Search Console and review its sitemap report for parse errors and the count of discovered versus indexed URLs. Split oversized sitemaps with a sitemap index file.

How it appears in analytics and logs

A sitemap listing non-canonical, redirecting, or noindex URLs sends conflicting signals and can waste crawl attention on URLs you do not want indexed. It is a discovery-quality issue: errors rarely block crawling entirely but degrade how efficiently crawlers prioritise.

Diagnostic use case

Audit a sitemap so it lists only canonical, indexable, 200-returning URLs within the size limits, improving the quality of the discovery signal you send crawlers.

What WebmasterID can help detect

WebmasterID records the status codes crawlers receive for URLs, helping you verify that the pages listed in your sitemap actually return 200 and are reached by crawlers.

Common mistakes

Listing redirecting or 404 URLs in the sitemap.
Including noindex or canonicalised-away pages alongside indexable ones.
Exceeding the 50,000-URL / 50 MB limit without using a sitemap index.
Faking lastmod so every URL always looks freshly updated.

Privacy and accuracy notes

Sitemap auditing uses your published URL list and the status codes crawlers receive, not visitor data. WebmasterID records crawler fetches without attaching them to any person.

↑ All diagnostic topics in Crawl diagnostics

Sources and verification notes

Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.