Diagnosing XML sitemap errors
An XML sitemap helps search engines discover and prioritise your URLs, but a sitemap full of the wrong URLs sends mixed signals. Common errors include listing redirecting or non-200 URLs, including noindex or canonicalised-away pages, exceeding the 50,000-URL or 50 MB limits, or referencing the wrong protocol/host. A clean sitemap lists only canonical, indexable, 200-returning URLs.
What a sitemap is for
An XML sitemap is a list of URLs you want search engines to know about, optionally with lastmod hints. It aids discovery, especially for large sites, new pages, or pages with few internal links. It is a suggestion, not a guarantee of crawling or indexing.
The sitemaps.org format and Google's documentation set the rules: a single sitemap file may contain up to 50,000 URLs and must not exceed 50 MB uncompressed; larger sites split into multiple files referenced by a sitemap index.
Common sitemap errors
The most damaging errors are including URLs that should not be there: pages that 301-redirect, return 404/410, are blocked by robots.txt, carry a noindex, or are canonicalised to a different URL. Each contradicts the 'index this' implication of listing it.
Other faults include exceeding the size limits without splitting, mixing http and https or www and non-www inconsistently with your canonical host, malformed XML, and stale lastmod values that never change. Referencing URLs on a different domain than the sitemap's host is also rejected.
- Listing redirecting or non-200 URLs
- Including noindex or canonicalised-away pages
- Exceeding 50,000 URLs or 50 MB per file without splitting
- Inconsistent protocol/host vs your canonical
- Malformed XML or stale lastmod values
How to diagnose and fix
Generate the sitemap from your canonical, indexable URL set only, so it cannot drift from what you actually want indexed. Validate that every listed URL returns 200 and matches its own canonical. Keep lastmod honest — only update it when the content meaningfully changes.
Submit the sitemap in Search Console and review its sitemap report for parse errors and the count of discovered versus indexed URLs. Split oversized sitemaps with a sitemap index file.
How it appears in analytics and logs
A sitemap listing non-canonical, redirecting, or noindex URLs sends conflicting signals and can waste crawl attention on URLs you do not want indexed. It is a discovery-quality issue: errors rarely block crawling entirely but degrade how efficiently crawlers prioritise.
Diagnostic use case
Audit a sitemap so it lists only canonical, indexable, 200-returning URLs within the size limits, improving the quality of the discovery signal you send crawlers.
What WebmasterID can help detect
WebmasterID records the status codes crawlers receive for URLs, helping you verify that the pages listed in your sitemap actually return 200 and are reached by crawlers.
Common mistakes
- Listing redirecting or 404 URLs in the sitemap.
- Including noindex or canonicalised-away pages alongside indexable ones.
- Exceeding the 50,000-URL / 50 MB limit without using a sitemap index.
- Faking lastmod so every URL always looks freshly updated.
Privacy and accuracy notes
Sitemap auditing uses your published URL list and the status codes crawlers receive, not visitor data. WebmasterID records crawler fetches without attaching them to any person.
Related pages
- Orphan pages diagnosis
An orphan page is one that no internal link points to. Crawlers discover pages mainly by following links, so an orphan is hard to find — it may exist only in a sitemap or be effectively invisible. Diagnosing orphans means comparing all known URLs against your internal link graph and fixing the gap with links.
- Canonical mismatch diagnosis
A canonical mismatch happens when your rel=canonical tag points one way while redirects, sitemaps, internal links, or hreflang point another. Conflicting signals confuse which URL should represent a piece of content, so crawlers may pick a canonical you did not intend. Aligning the signals fixes it.
- Diagnosing index bloat
Index bloat is when a site has far more URLs indexed than it has genuinely valuable, distinct pages. It comes from faceted-navigation variants, tracking parameters, paginated and filtered duplicates, thin or auto-generated pages, and internal search results. Bloat dilutes crawl attention and can bury your important pages among low-value ones. Diagnosis means comparing indexed counts to your real page inventory.
- Website observability
Confirm sitemap URLs return 200 and are crawled, recorded server-side.
Sources and verification notes
Last reviewed 2026-06-24. Facts are checked against primary/official sources where available; uncertain specifics are marked “Data not yet verified” rather than guessed.