A sitemap is a list of URLs you want Google to know about, with optional metadata (lastmod, changefreq, priority). Google treats it as a hint — it does not guarantee indexing, but it strongly influences crawl prioritization. A clean sitemap is one of the cheapest ways to get new and updated content discovered faster. A dirty sitemap actively trains Google to ignore the signal.
Why it matters
Three failure modes that quietly reduce crawl efficiency:
- Lying
lastmod. If every URL claims it was updated yesterday but the content hasn’t changed in two years, Google detects the lie and starts ignoringlastmodsite-wide. This is the most common sitemap sin. - Including non-indexable URLs. Sitemaps that contain noindexed pages, 404s, redirects, or non-canonical URLs send mixed signals. Google’s documentation explicitly tells you to only include URLs you want indexed.
- Oversize files. The hard limits are 50 MB uncompressed and 50,000 URLs per sitemap file. Hit either and Google may fail to parse the whole thing silently.
changefreq and priority have been largely ignored by Google since at least 2017. Don’t spend effort on them; do spend effort on accurate lastmod.
How to detect it
Pull and inspect:
curl -s https://example.com/sitemap.xml | head -50
For sitemap index files (a sitemap that lists other sitemaps):
curl -s https://example.com/sitemap.xml \
| grep -oE '<loc>[^<]+</loc>' \
| sed 's/<loc>//;s/<\/loc>//' \
| head -20
Key checks to run programmatically:
- Does every URL return 200? Check a sample.
- Does every URL self-reference its own canonical? A sitemap URL whose canonical points elsewhere is a contradiction.
- Are any URLs noindexed? Same contradiction.
- Does
lastmodcorrelate with actual content changes? Spot-check 10 URLs from the sitemap against their visible publish/update dates. - File size and URL count. Under 50 MB and 50,000 URLs per file.
Google Search Console → Sitemaps reports submission status, last read date, URL count, and any parse errors. The “Couldn’t fetch” or “Has errors” status is the production signal.
The fix
Universal — what a clean sitemap looks like
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://example.com/blog/post-one</loc>
<lastmod>2026-03-14</lastmod>
</url>
<url>
<loc>https://example.com/blog/post-two</loc>
<lastmod>2025-12-01</lastmod>
</url>
</urlset>
Rules:
- Absolute URLs, including protocol.
- Match the canonical exactly — same case, same trailing slash, same www/non-www.
lastmodin W3C Datetime format. Date-only (2026-03-14) is fine; full ISO 8601 with time is also fine.- Skip
changefreqandpriority— Google ignores them. - Only include URLs you want indexed.
Sitemap index for sites with many URLs
Above ~50,000 URLs, split into multiple sitemaps referenced by an index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://example.com/sitemap-posts.xml</loc>
<lastmod>2026-03-14</lastmod>
</sitemap>
<sitemap>
<loc>https://example.com/sitemap-products.xml</loc>
<lastmod>2026-03-14</lastmod>
</sitemap>
</sitemapindex>
Useful pattern: split by content type so a single batch update of products doesn’t bust the lastmod on the entire site’s sitemap.
Submit and discover
Tell Google about the sitemap two ways:
robots.txtdeclaration (universal):
Sitemap: https://example.com/sitemap.xml
- Search Console → Sitemaps → submit URL. Required for the Search Console parse status report.
Specialized sitemaps
For high-frequency content, dedicated sitemap formats:
- News sitemap — for sites in Google News. URLs published in the last 48 hours, with
<news:publication>metadata. Spec. - Image sitemap — extension that lists images per page. Useful only if image search is a meaningful traffic channel.
- Video sitemap — required to surface in Google’s video results properly.
WordPress
WP 5.5+ generates /wp-sitemap.xml natively. Yoast and Rank Math generate richer sitemaps with more granular splits. Disable one if both are active — competing sitemaps confuse Search Console.
Next.js (App Router)
// app/sitemap.ts
import type { MetadataRoute } from "next";
export default async function sitemap(): Promise<MetadataRoute.Sitemap> {
const posts = await db.query.posts.findMany();
return posts.map((p) => ({
url: `https://example.com/blog/${p.slug}`,
lastModified: p.updatedAt, // real updated_at, not Date.now()
}));
}
Next.js auto-splits sitemaps over 50,000 URLs into an index.
Common pitfalls
lastmodset tonow()on every build. Looks innocent; trains Google to ignore the signal. Use the actual content update timestamp.- Including noindexed URLs. Contradicts the sitemap’s purpose. Filter them out at generation time.
- Including URLs that canonicalize elsewhere. Same contradiction.
- Sitemap on HTTP when the site is HTTPS. The protocol must match.
Sitemap: http://example.com/sitemap.xmlon an HTTPS site is invalid. - Listing URLs with tracking params.
?utm_source=emailin a sitemap creates duplicate URLs and dilutes canonical signals. - Sitemap returning 200 with HTML instead of XML. Some routers serve a friendly 404 page at /sitemap.xml. Verify the response is
Content-Type: application/xml.