Skip to content
← Back to Learn
Crawling & Indexing Warning

XML sitemap hygiene issues

Stale URLs, oversized files, lying lastmod dates, and sitemaps listing non-indexable URLs all degrade the signal. How to keep a sitemap that Google trusts.

A sitemap is a list of URLs you want Google to know about, with optional metadata (lastmod, changefreq, priority). Google treats it as a hint — it does not guarantee indexing, but it strongly influences crawl prioritization. A clean sitemap is one of the cheapest ways to get new and updated content discovered faster. A dirty sitemap actively trains Google to ignore the signal.

Why it matters

Three failure modes that quietly reduce crawl efficiency:

  1. Lying lastmod. If every URL claims it was updated yesterday but the content hasn’t changed in two years, Google detects the lie and starts ignoring lastmod site-wide. This is the most common sitemap sin.
  2. Including non-indexable URLs. Sitemaps that contain noindexed pages, 404s, redirects, or non-canonical URLs send mixed signals. Google’s documentation explicitly tells you to only include URLs you want indexed.
  3. Oversize files. The hard limits are 50 MB uncompressed and 50,000 URLs per sitemap file. Hit either and Google may fail to parse the whole thing silently.

changefreq and priority have been largely ignored by Google since at least 2017. Don’t spend effort on them; do spend effort on accurate lastmod.

How to detect it

Pull and inspect:

curl -s https://example.com/sitemap.xml | head -50

For sitemap index files (a sitemap that lists other sitemaps):

curl -s https://example.com/sitemap.xml \
  | grep -oE '<loc>[^<]+</loc>' \
  | sed 's/<loc>//;s/<\/loc>//' \
  | head -20

Key checks to run programmatically:

  1. Does every URL return 200? Check a sample.
  2. Does every URL self-reference its own canonical? A sitemap URL whose canonical points elsewhere is a contradiction.
  3. Are any URLs noindexed? Same contradiction.
  4. Does lastmod correlate with actual content changes? Spot-check 10 URLs from the sitemap against their visible publish/update dates.
  5. File size and URL count. Under 50 MB and 50,000 URLs per file.

Google Search Console → Sitemaps reports submission status, last read date, URL count, and any parse errors. The “Couldn’t fetch” or “Has errors” status is the production signal.

The fix

Universal — what a clean sitemap looks like

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/blog/post-one</loc>
    <lastmod>2026-03-14</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/post-two</loc>
    <lastmod>2025-12-01</lastmod>
  </url>
</urlset>

Rules:

  • Absolute URLs, including protocol.
  • Match the canonical exactly — same case, same trailing slash, same www/non-www.
  • lastmod in W3C Datetime format. Date-only (2026-03-14) is fine; full ISO 8601 with time is also fine.
  • Skip changefreq and priority — Google ignores them.
  • Only include URLs you want indexed.

Sitemap index for sites with many URLs

Above ~50,000 URLs, split into multiple sitemaps referenced by an index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap-posts.xml</loc>
    <lastmod>2026-03-14</lastmod>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap-products.xml</loc>
    <lastmod>2026-03-14</lastmod>
  </sitemap>
</sitemapindex>

Useful pattern: split by content type so a single batch update of products doesn’t bust the lastmod on the entire site’s sitemap.

Submit and discover

Tell Google about the sitemap two ways:

  1. robots.txt declaration (universal):
Sitemap: https://example.com/sitemap.xml
  1. Search Console → Sitemaps → submit URL. Required for the Search Console parse status report.

Specialized sitemaps

For high-frequency content, dedicated sitemap formats:

  • News sitemap — for sites in Google News. URLs published in the last 48 hours, with <news:publication> metadata. Spec.
  • Image sitemap — extension that lists images per page. Useful only if image search is a meaningful traffic channel.
  • Video sitemap — required to surface in Google’s video results properly.

WordPress

WP 5.5+ generates /wp-sitemap.xml natively. Yoast and Rank Math generate richer sitemaps with more granular splits. Disable one if both are active — competing sitemaps confuse Search Console.

Next.js (App Router)

// app/sitemap.ts
import type { MetadataRoute } from "next";

export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
  const posts = await db.query.posts.findMany();
  return posts.map((p) => ({
    url: `https://example.com/blog/${p.slug}`,
    lastModified: p.updatedAt,  // real updated_at, not Date.now()
  }));
}

Next.js auto-splits sitemaps over 50,000 URLs into an index.

Common pitfalls

  • lastmod set to now() on every build. Looks innocent; trains Google to ignore the signal. Use the actual content update timestamp.
  • Including noindexed URLs. Contradicts the sitemap’s purpose. Filter them out at generation time.
  • Including URLs that canonicalize elsewhere. Same contradiction.
  • Sitemap on HTTP when the site is HTTPS. The protocol must match. Sitemap: http://example.com/sitemap.xml on an HTTPS site is invalid.
  • Listing URLs with tracking params. ?utm_source=email in a sitemap creates duplicate URLs and dilutes canonical signals.
  • Sitemap returning 200 with HTML instead of XML. Some routers serve a friendly 404 page at /sitemap.xml. Verify the response is Content-Type: application/xml.
From diagnosis to deployment

Find the issue. Ship the fix.

Use Learn to understand the problem, then run Serpwise against your own site to see what can be approved and deployed.