August 12, 2025·3 min read

XML Sitemaps Explained: What They Are and Why Google Needs One

An XML sitemap is the map you hand Google so it can find every page on your site without guessing. Here is what belongs in one, what to leave out, and how to keep it accurate as your site grows.

An XML sitemap is a file — usually at yourdomain.com/sitemap.xml — that lists every URL on your site you want search engines to know about. It does not guarantee indexing, but it removes the guesswork: instead of relying on Google to discover pages by following links, you hand it a complete, structured list.

For small sites with strong internal linking, a sitemap is a nice-to-have. For larger sites, sites with deep page hierarchies, or sites that publish frequently, it is close to mandatory. Google's crawl budget is finite, and an accurate sitemap is the most direct way to make sure that budget is spent on pages that matter.

What a sitemap actually contains

A valid XML sitemap is a list of <url> entries, each with a <loc> (the absolute URL) and optionally <lastmod> (last modified date), <changefreq>, and <priority>. Google has stated publicly that it largely ignores changefreq and priority as ranking signals, but lastmod is genuinely useful — it tells crawlers which pages have changed since their last visit, which helps prioritize re-crawling.

What belongs in a sitemap — and what doesn't

  • Include: every canonical, indexable page you want in search results — articles, product pages, category pages, tool pages.
  • Exclude: pages blocked by robots.txt, pages with a noindex meta tag, redirected URLs, duplicate or parameterized URLs (such as ?sort=price variants), and pagination beyond the canonical first page in most cases.

A common mistake is generating a sitemap automatically from a CMS without filtering it — this often produces thousands of low-value URLs (tag pages, author archives, internal search result pages) that dilute crawl budget and can make a site look bloated with thin content to an automated quality reviewer. A clean, intentional sitemap is itself a quality signal.

Sitemaps and AI crawlers

Traditional XML sitemaps are built for search engine crawlers, but a newer, complementary file — llms.txt — serves a similar purpose for AI systems like ChatGPT and Perplexity. Where a sitemap is a flat, machine-readable URL list, an llms.txt file is a curated, human-readable summary that points AI crawlers toward your most authoritative pages. Many teams now maintain both: an XML sitemap for completeness, and an llms.txt for AI-specific discovery and citation.

How to keep your sitemap accurate

Static sitemaps go stale fast. The most reliable approach is to generate the sitemap dynamically at build or request time directly from your routing or CMS data, so it is always in sync with what actually exists on the site. If you're working with a framework like Next.js, this typically means a sitemap route that pulls the same data source your pages render from — eliminating the chance of listing a URL that no longer resolves, or omitting one that does.

Once your sitemap is live, submit it in Google Search Console under Sitemaps, and reference it from your robots.txt file with a Sitemap: directive so crawlers that don't check Search Console can still find it.

Sitemap size limits

A single sitemap file is capped at 50,000 URLs and 50MB uncompressed. Sites that exceed this need a sitemap index file — an XML file that lists multiple child sitemaps. Most sites well under that threshold are better served by one clean file than an unnecessarily fragmented set.

A sitemap won't fix weak content, but it ensures the content you do have gets a fair chance to be crawled, evaluated, and indexed — which is the first prerequisite for ranking at all.