Home · The 15 signals · Robots not blocking
Shopify robots.txt for AI shopping agents
There is exactly one signal where a single text-file edit by a teammate at 4 PM on a Friday can make your entire store invisible to every AI shopping agent on the planet by Monday morning. That signal is /robots.txt. Get every other signal perfect — flawless Product JSON-LD, complete GTIN coverage, an industry-leading sitemap — and one stray Disallow: / renders all of it irrelevant. Crawlers honour robots.txt before they fetch anything else, so when the file says no, no fetch happens, no signals get scored, no products get surfaced. This is the most-fixable signal at CatalogScan and also the one most likely to fail catastrophically and silently — there is no error page, no Shopify alert, no Slack ping. Your store just stops appearing in answers.
https://yourstore.com/robots.txt over HTTPS that returns HTTP 200 (or 404 — Shopify treats absent robots.txt as fully open) AND, if a body is present, contains no blanket Disallow: /, no Disallow: /products, no Disallow: /products.json, no Disallow: /collections, and no Disallow targeted specifically at AI-agent user agents (GPTBot, PerplexityBot, Google-Extended, ClaudeBot, Applebot-Extended) inside the User-agent: * rule set. A 403 return — common when Cloudflare's "Bot Fight Mode" is on — also fails the signal, because aggressive bot interception reaches into legitimate AI crawlers too.
What it is
Robots Exclusion Protocol (the formal name for robots.txt) is the original, plain-text contract between a website and the crawlers that visit it. The file lives at the root path /robots.txt and consists of one or more rule blocks, each starting with a User-agent: declaration followed by Allow: and Disallow: lines. Crawlers fetch this file before they fetch anything else and obey the rules they match against their user-agent string.
What you want
User-agent: * Disallow: /admin Disallow: /cart Disallow: /checkout Disallow: /orders Disallow: /account Disallow: /*?sort_by* Sitemap: https://store.com/sitemap.xml
What we still find
User-agent: * Disallow: /
Shopify's default robots.txt blocks the few paths that should never be crawled — admin, checkout, cart, account — and explicitly allows everything else. The faceted-nav blockers (Disallow: /*?sort_by*) are also good defaults because every ?sort_by permutation is content-equivalent to the unsorted page, and serving 50 sorted variants of the same collection wastes crawl budget. Everything else is open, every other Shopify floor signal becomes scoreable, and every AI agent can read your catalog.
The bad shape on the right is rarer than it used to be (Shopify's password-gate workflow has improved) but we still see it on stores that copied a snippet from a generic SEO blog post that was written for static-site Jekyll deployments, on stores that flipped storefront_password on for "stealth launch" and never flipped it back, and on Hydrogen rebuilds where the developer copied a default Vercel robots.txt generator scaffold without thinking about the catalog impact.
The AI-agent crawler user-agents to know
| Agent | User-agent string | Used for |
|---|---|---|
| OpenAI | GPTBot | Training + ChatGPT Shopping product retrieval |
| OpenAI search | OAI-SearchBot | SearchGPT live retrieval (separate from training) |
| OpenAI in-prod | ChatGPT-User | Per-conversation fetch when a user clicks a citation |
| Perplexity | PerplexityBot | Index + Perplexity Shopping retrieval |
| Anthropic | ClaudeBot · Claude-Web | Claude tool-use catalog retrieval |
| Google AI | Google-Extended | Gemini, AI Overview, AI Mode (separate from Googlebot) |
| Apple AI | Applebot-Extended | Apple Intelligence shopping queries |
| Bing AI | bingbot (with nocache) | Copilot answers (Bing's main bot doubles as the AI surface) |
If your robots.txt singles out any of these for Disallow: / — even with the rest of the file fully open — the corresponding AI shopping surface skips your catalog. Allow them all. A blanket User-agent: * with no targeted blocks is the cleanest posture and what we score for full credit.
Why AI shopping agents care
- Pre-flight check, hard-fail. Every well-behaved AI crawler fetches
/robots.txtfirst. A failed fetch (403, timeout) or a body that disallows the path it wanted means the agent does not fetch — full stop. Your Product JSON-LD is impeccable, your sitemap is perfect, and none of it gets scored, because no crawl ever happens. - Targeted blocks compound across surfaces. Some operators block
GPTBotwhile leaving everything else open, hoping to opt out of LLM training. The same block also opts you out of ChatGPT Shopping product retrieval — separate use, same crawler. The unintended consequence: Perplexity, Google AI, and Claude still see you, but the largest AI shopping surface does not. - Trust signal in tied rankings. A clean, well-formed robots.txt that explicitly welcomes the major AI agents reads as "this brand wants to be found by AI shoppers" — a positive prior on tied ranking decisions. A robots.txt with a long list of
Disallows feels defensive, even when no rule actively blocks the bot. - Sitemap discovery anchor. Robots.txt is also where you advertise your sitemap. The
Sitemap:directive points crawlers at/sitemap.xml; missing it doesn't kill discovery (most agents check the conventional path anyway) but it's a one-line confidence add.
How to test it on your store
Two curls. From any terminal:
curl -sI https://yourstore.com/robots.txt | head -3 curl -s https://yourstore.com/robots.txt
Three things to verify:
- HTTP 200 (or 404). A 200 is the normal case; a 404 means no robots.txt exists, which is fully permissive (every URL is crawlable). A 403 is the bug — usually Cloudflare Bot Fight Mode or a WAF rule intercepting bot traffic. A 301/302 is half-credit on low-reliability bots; emit the file at the canonical hostname.
- No blanket Disallow. Read the body. Inside the
User-agent: *block, there must be noDisallow: /, noDisallow: /products, noDisallow: /products.json, noDisallow: /collections, noDisallow: /pages, noDisallow: /blogs.Disallow: /admin,/checkout,/cart,/account, and/*?sort_by*are all fine. - No targeted AI-bot blocks. Search the body for
GPTBot,PerplexityBot,Google-Extended,ClaudeBot,Applebot-Extended. If any of them have their ownUser-agent:rule withDisallow: /, that's a deliberate (or copy-pasted) block. Either remove the rule or change toAllow: /.
The free CatalogScan scan checks all three plus a deeper "what would each major bot get if it tried to fetch /products.json right now" simulation — so a User-agent: * open posture with a hidden Cloudflare WAF challenge gets caught even when the static text reads fine.
How to fix it
robots.txt.liquid in the theme5 minfreeShopify generates robots.txt from a Liquid template. To override: in your theme code, create templates/robots.txt.liquid. The recommended starting point is the Shopify-documented default — copy it verbatim, then add only what you need on top. Most stores need to change zero rules. To explicitly welcome AI agents (recommended), add this block at the top:
{%- comment -%} Welcome AI shopping agents explicitly {%- endcomment -%}
User-agent: GPTBot
Allow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: Applebot-Extended
Allow: /
Then ensure the User-agent: * block keeps the default Shopify Disallow set (admin, checkout, cart, account) and nothing else.
storefront_password2 minfreeShopify Admin → Online Store → Preferences → Password protection. Toggle off. The password gate intercepts every request including /robots.txt and returns the gate HTML with HTTP 200 — which fails this signal silently. Re-curl after toggling and verify you see the actual robots.txt body.
Add app/routes/robots[.]txt.tsx (the brackets escape the dot in Remix's filename convention). Return a plain-text Response with the rule body. Mirror the Shopify default at minimum; add the explicit AI-agent Allow block on top. Remember to set Content-Type: text/plain in the response headers — some Hydrogen scaffolds default to text/html which strict parsers reject. Cache at the edge for 1 hour (Cache-Control: public, max-age=3600); the file rarely changes and you don't want every bot fetch round-tripping to your origin.
app/robots.ts10 minfreeNext 13+ has a first-party file convention. Create app/robots.ts exporting a default function returning the MetadataRoute.Robots shape: { rules: [{ userAgent: '*', allow: '/', disallow: ['/admin', '/checkout', '/cart', '/account'] }, ...], sitemap: 'https://store.com/sitemap.xml' }. The framework emits a valid robots.txt at the canonical path automatically. Add separate { userAgent: 'GPTBot', allow: '/' } entries for each AI agent if you want explicit welcomes.
Three Cloudflare features can intercept legitimate AI crawlers and silently fail this signal. (1) Bot Fight Mode — disable for the storefront zone, or carve a Skip → All Bot Fight Mode features rule for known AI user agents. (2) AI Scrapers and Crawlers managed rule — Cloudflare's "block AI bots" toggle (Security → Bots → AI Scrapers); turn it off if you want AI shopping surfaces to see your catalog. (3) Custom WAF rule with cf.client.bot in the expression — review and exclude the AI agent UAs. Test with curl -A "GPTBot/1.0 (+https://openai.com/gptbot)" https://yourstore.com/robots.txt — if you get the file, you're clear; if you get a 403 or a challenge HTML, you're blocking.
5 mistakes we keep finding
1. The storefront_password the founder forgot they set
The single most common cause of a failed robots.txt signal on a brand-new launch. The founder turned on a password during pre-launch testing, finished the site, told the world the URL, and forgot to flip it back off. Every URL on the storefront — including /robots.txt, /sitemap.xml, /products.json — returns the password gate as HTTP 200 HTML. Three floor signals fail simultaneously. The fix is one toggle.
2. Cloudflare Bot Fight Mode auto-blocking AI agents
Bot Fight Mode is on by default for many free Cloudflare zones. It serves a JavaScript challenge to anything that looks bot-like — and "looks bot-like" includes every AI shopping crawler that sets a normal UA. The robots.txt body is fine; the WAF is the problem. Cloudflare added an "AI Scrapers and Crawlers" managed rule in 2024 that defaults to off for new zones but on for some legacy ones; verify by reading the active rules under Security → Bots, not by reading robots.txt.
3. Targeted User-agent: GPTBot block left over from "no AI training" stance
Late 2023 a wave of "block GPTBot to opt out of LLM training" advice landed in operator newsletters. Stores added the rule and never revisited. By 2026, GPTBot also drives ChatGPT Shopping product retrieval — same crawler, different surface. The rule that opted out of training also opted out of being a candidate when a user asks ChatGPT "where can I buy X." Decide deliberately: if you want AI shopping traffic, Allow: / for these agents.
4. Headless rebuild copied a generic robots.txt scaffold
The Next.js, Astro, and Vercel "generate a robots.txt" tutorials almost universally show User-agent: * + Disallow: /admin + Disallow: /api as the example. Fine for a generic web app — wrong for a Shopify replacement, where you also need to block /checkout, /cart, /account, /orders, the sort-permutation trap, and ideally welcome AI agents explicitly. Mirror the Shopify default at minimum; don't trust generic scaffolds.
5. Disallow: /products as a "stealth launch" pattern
Some operators block /products during a soft-launch period, intending to flip it open at launch. The flip never happens, or happens but the CDN cache holds the old robots.txt for 24 hours, or someone copies the staging robots.txt to production by mistake. Either way: the entire catalog is invisible while every other ranking signal is in good shape. Use storefront_password for stealth launches, not Disallow; the password gate is unambiguous, and removing it is one toggle that you can verify with curl in 5 seconds.
See also
- The 15 signals — full reference
- Sitemap.xml: the discovery surface AI agents read first (the second floor signal that fails the same way
storefront_passwordkills it) /products.json: the AI bulk-ingest feed (third file that fails when robots.txt blocks it)- Product JSON-LD on PDPs (the signal that produces zero value if robots.txt blocks the PDP fetch)
- The full 18-signal Agentic Storefronts checklist
- Leaderboard: 100 DTC stores scored on robots.txt and 14 other signals
Is your robots.txt actually open?
Free 2-minute scan. We fetch /robots.txt with each major AI-agent UA, parse the rule blocks, and flag any blocker before it costs you a single AI-shopping placement.