Home · The 15 signals · Sitemap
Shopify sitemap.xml for AI shopping agents
A valid /sitemap.xml is the discovery contract between your storefront and every AI shopping agent that wants to ingest your catalog. ChatGPT Shopping, Perplexity Shopping, Google AI Mode, Bing's AI overview, and Shopify's own Global Catalog all check this file first — before they ever fetch a single product page. Default Shopify ships a perfectly valid sitemap auto-generated from your live catalog. The problem: as soon as a store goes headless, switches CDN, password-protects the dev URL, or routes /sitemap.xml through a Cloudflare worker that has the wrong cache rules, this floor signal silently breaks. The store gets quietly excluded from every AI shopping surface that uses sitemaps as a discovery anchor.
https://yourstore.com/sitemap.xml over HTTPS that returns HTTP 200, a parseable XML body, and either a <urlset> with at least one <url> child or a <sitemapindex> with at least one <sitemap> child. Anything else — 404, 403, an HTML login page, an empty body, or a malformed XML root — scores zero. Partial credit doesn't exist on this signal: either agents can read your map or they can't.
What it is
An XML sitemap is a list of canonical URLs your storefront wants crawled. Two valid root shapes:
Small catalogs (<500 products)
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://store.com/</loc>
<lastmod>2026-04-30</lastmod>
</url>
<url>
<loc>https://store.com/products/foo</loc>
<lastmod>2026-04-30</lastmod>
</url>
</urlset>
Bigger catalogs (Shopify default)
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://store.com/sitemap_products_1.xml</loc>
</sitemap>
<sitemap>
<loc>https://store.com/sitemap_collections_1.xml</loc>
</sitemap>
</sitemapindex>
Default Shopify (Online Store 2.0 or earlier) emits the sitemapindex shape and partitions sub-sitemaps by content type: products, collections, pages, and blog posts, each as a separate file. The index references them by absolute URL. The product sub-sitemap further splits at 5,000 URLs per file — a 50,000-product catalog has ten sub-sitemaps under the index. Agents follow the references automatically; you don't manage the partitioning.
The 50 MB / 50,000-URL hard limit per file is from the sitemaps.org protocol. Most stores never hit it, but a 100k-SKU multi-locale catalog absolutely will if you flatten everything into one urlset — which is the most common failure mode on custom-built front-ends.
Why AI shopping agents care
- Discovery without spidering. The alternative to a sitemap is following links from your homepage outward — a process that misses long-tail collection pages, paginated product archives, and any URL only reachable through faceted nav. For a 10,000-SKU catalog, the difference between sitemap-driven indexing and link-following is measured in weeks of latency and meaningful gaps in long-tail coverage.
- Lastmod for incremental refresh. Each
<lastmod>tells the agent when the URL last changed. Agents prioritize re-fetching URLs whose lastmod moved since their last crawl — your seasonal launch, sale, or restocked SKU surfaces in agent results within hours instead of weeks. - Confidence weighting. AI rankers give a "this surface is well-maintained" prior to stores with healthy sitemaps. A missing or empty sitemap, conversely, hints at abandonment or hostile-to-crawl posture — both of which downweight your store in tied ranking decisions.
- Cross-surface discovery. Your sitemap also feeds Google Merchant Center auto-discovery, Bing Webmaster Tools, and the Shopify Global Catalog ingestion path. One file in good shape feeds half a dozen downstream surfaces; one file broken takes them all out.
How to test it on your store
One curl is enough. Open a terminal:
curl -sI https://yourstore.com/sitemap.xml | head -3 curl -s https://yourstore.com/sitemap.xml | head -20
Three things to verify:
- HTTP 200, not 301/302/404/403. A redirect from
https://towww.(or vice versa) costs an extra hop on every crawl and is a half-credit footgun on low-reliability bots. A 404 or 403 is a zero. Content-Type: application/xmlortext/xml. If your CDN is returningContent-Type: text/htmlwith the XML body — usually because a Cloudflare worker stripped headers — strict parsers reject it.- Body starts with
<?xmlfollowed by<urlsetor<sitemapindex. If you see<!DOCTYPE html>, your route is serving the 404 page or a login wall and the XML is gone.
The free CatalogScan scan runs all three of these checks plus a deep-fetch into one of the sub-sitemaps if your root is a sitemapindex — so you find out if the partition you depend on is actually populated, not just referenced.
How to fix it
If you're on the standard Shopify storefront (Dawn, Trade, or any Online Store 2.0 theme), the sitemap is auto-generated from your products, collections, pages, and blog posts. It updates within minutes of any catalog change. The only way to break it on default Shopify is to set a storefront_password — that returns the password gate at every URL including /sitemap.xml. Remove the password.
Hydrogen 2.0 ships with a sitemap.[type].xml route convention. Add a top-level app/routes/sitemap.xml.tsx that returns a sitemapindex referencing per-type sub-sitemaps, plus per-type files (app/routes/sitemap.products.xml.tsx, sitemap.collections.xml.tsx, sitemap.pages.xml.tsx) that each fetch the relevant Storefront API connection and emit one <url> per node. The Hydrogen template repo has an example in examples/sitemap you can copy directly. Don't try to flatten everything into one file — past 5,000 products you'll hit the protocol limit.
App-router project: add app/sitemap.ts exporting a default async function that returns the MetadataRoute.Sitemap shape (Next 13.4+). Inside, page through products and collections on the Storefront API in batches of 250. For catalogs over 5,000 products, switch to sitemapindex via Next's generateSitemaps shape — it generates sitemap-0.xml, sitemap-1.xml automatically and emits the index. Don't forget to set revalidate = 3600 on the route — agents re-fetch frequently and you don't want every fetch to round-trip to Shopify.
Cheapest restoration path: have your edge layer proxy /sitemap.xml back to the Shopify-managed origin, e.g. myshop.myshopify.com/sitemap.xml. The catch: Shopify's emitted URLs use the .myshopify.com domain, not your custom domain. Either run a regex rewrite on the response body (Cloudflare workers, Vercel rewrites, Caddy replace_status module) or skip the proxy and emit your own sitemap from Storefront API data. The proxy-with-rewrite is faster to ship; the from-scratch sitemap gives you canonical URLs you actually control.
5 mistakes we keep finding
1. The sub-sitemap exists but the index doesn't reference it
Headless rebuilds frequently emit /sitemap_products.xml for the products sub-sitemap, but the root /sitemap.xml still points at the old default Shopify URLs (/sitemap_products_1.xml) which now 404. Agents fetch the index, follow the broken references, give up. View the index body and verify every <loc> child returns 200.
2. Dev-store password gate forgotten before launch
On a fresh Shopify dev store you set a password under Online Store → Preferences. The password gate intercepts every URL — including /sitemap.xml, /products.json, and /robots.txt — and returns an HTML login page. Agents fetch, get HTML, score zero on every floor signal at once. Remove the password the day you launch (and verify with the curl above).
3. Sitemap excludes collection pages
Some headless rebuilds emit only product URLs and skip collections, blog posts, and editorial pages. Collections are how AI agents discover your taxonomy — "best running shoes for flat feet" matches a collection page first, then drills down. Without collections in the sitemap, you compete only on individual-product matches and lose every category-level query. Include all four content types: products, collections, pages, blog posts.
4. Robots.txt blocks the sitemap path
Less common but extremely costly: a Disallow: /sitemap in /robots.txt tells crawlers not to fetch the file. Your sitemap is technically perfect; nobody can read it. Always cross-check robots and sitemap against each other — and the robots not blocking signal is worth a quick visit if you're working through these.
5. Lastmod stuck at the deploy date
Some custom sitemaps emit <lastmod> as the build/deploy timestamp instead of the actual last-modified date of each URL. After the first deploy you get a sitemap that says every URL changed today, every day, forever. Agents stop trusting lastmod and fall back to a slow re-crawl heuristic. Emit lastmod from the product's updated_at on Storefront API, not from your build pipeline.
See also
- The 15 signals — full reference
/products.json: the AI bulk-ingest feed (the second-largest discovery surface after sitemap)- Product JSON-LD on PDPs (what agents read when they fetch the URLs your sitemap discovers)
- Full 18-signal Agentic Storefronts checklist
- ProductGroup JSON-LD: the next signal up after sitemap is in good shape
- Leaderboard: 100 DTC stores scored on sitemap and 14 other signals
Is your sitemap actually live?
Free 2-minute scan. We fetch your /sitemap.xml, parse the XML, walk the index, and score the result alongside 14 other AI-shopping signals.