Sitemaps are a simple idea: give crawlers a clean list of URLs you want discovered. In practice, sitemaps can quietly rot. Incorrect URLs, wrong lastmod dates, redirects, and duplicates waste crawl budget and slow down discovery of new pages. Good sitemap hygiene is mostly consistency and automation.
List canonical URLs only
Your sitemap should contain the canonical version of each page, and only one entry per page. Common sources of duplication:
- Multiple formats for the same page (
/pagevs/page.html). - Index variants (
/section/vs/section/index.html). - Tracking parameters (
?utm_source=etc).
Canonical tags and consistent internal linking are the companion controls (see canonical URLs).
Make lastmod trustworthy
lastmod is a hint. If you set it to "today" for every URL on every deploy, crawlers learn to ignore it. Prefer:
- Content-based updates. Update
lastmodwhen the page content changes meaningfully. - Stable timestamps. Use build times only if they actually reflect content changes.
- Section awareness. Blog indexes may change more often than individual posts.
Split sitemaps as you grow
Sitemaps have limits (URL count and file size). Even before you hit limits, splitting can help:
- Separate static pages vs blog posts vs programmatic pages.
- Update the frequently-changing sitemap more often.
- Use a sitemap index if you maintain multiple sitemap files.
Automate generation and validation
Manual sitemaps fail because humans forget. Add a build step that:
- Extracts canonical URLs from HTML pages.
- Writes
sitemap.xmldeterministically. - Fails the build if required meta tags are missing or canonicals collide.
If you publish at scale, treat sitemap generation as a production pipeline, not a one-off task.
Keep robots.txt and sitemap aligned
Make it easy for crawlers to find the sitemap by listing it in robots.txt. If you move the sitemap URL, update both robots.txt and Search Console properties.
Sitemap hygiene is boring, but it is one of the clearest signals you can give crawlers: here are the pages that matter, in the format that matters, with dates you can trust.