Home · The 15 signals · Robots not blocking

Shopify robots.txt for AI shopping agents

There is exactly one signal where a single text-file edit by a teammate at 4 PM on a Friday can make your entire store invisible to every AI shopping agent on the planet by Monday morning. That signal is /robots.txt. Get every other signal perfect — flawless Product JSON-LD, complete GTIN coverage, an industry-leading sitemap — and one stray Disallow: / renders all of it irrelevant. Crawlers honour robots.txt before they fetch anything else, so when the file says no, no fetch happens, no signals get scored, no products get surfaced. This is the most-fixable signal at CatalogScan and also the one most likely to fail catastrophically and silently — there is no error page, no Shopify alert, no Slack ping. Your store just stops appearing in answers.

Last updated 2026-04-30 · Floor signal · 15 pts

15 / 100Floor signal weight

~6%Of stores fail outright

1 lineUsually fixes it

What this signal scores: a fetch of https://yourstore.com/robots.txt over HTTPS that returns HTTP 200 (or 404 — Shopify treats absent robots.txt as fully open) AND, if a body is present, contains no blanket Disallow: /, no Disallow: /products, no Disallow: /products.json, no Disallow: /collections, and no Disallow targeted specifically at AI-agent user agents (GPTBot, PerplexityBot, Google-Extended, ClaudeBot, Applebot-Extended) inside the User-agent: * rule set. A 403 return — common when Cloudflare's "Bot Fight Mode" is on — also fails the signal, because aggressive bot interception reaches into legitimate AI crawlers too.

What it is

Robots Exclusion Protocol (the formal name for robots.txt) is the original, plain-text contract between a website and the crawlers that visit it. The file lives at the root path /robots.txt and consists of one or more rule blocks, each starting with a User-agent: declaration followed by Allow: and Disallow: lines. Crawlers fetch this file before they fetch anything else and obey the rules they match against their user-agent string.

Correct — Shopify default

What you want

User-agent: *
Disallow: /admin
Disallow: /cart
Disallow: /checkout
Disallow: /orders
Disallow: /account
Disallow: /*?sort_by*

Sitemap: https://store.com/sitemap.xml

Catastrophic — full block

What we still find

User-agent: *
Disallow: /

Shopify's default robots.txt blocks the few paths that should never be crawled — admin, checkout, cart, account — and explicitly allows everything else. The faceted-nav blockers (Disallow: /*?sort_by*) are also good defaults because every ?sort_by permutation is content-equivalent to the unsorted page, and serving 50 sorted variants of the same collection wastes crawl budget. Everything else is open, every other Shopify floor signal becomes scoreable, and every AI agent can read your catalog.

The bad shape on the right is rarer than it used to be (Shopify's password-gate workflow has improved) but we still see it on stores that copied a snippet from a generic SEO blog post that was written for static-site Jekyll deployments, on stores that flipped storefront_password on for "stealth launch" and never flipped it back, and on Hydrogen rebuilds where the developer copied a default Vercel robots.txt generator scaffold without thinking about the catalog impact.

The AI-agent crawler user-agents to know

Agent	User-agent string	Used for
OpenAI	`GPTBot`	Training + ChatGPT Shopping product retrieval
OpenAI search	`OAI-SearchBot`	SearchGPT live retrieval (separate from training)
OpenAI in-prod	`ChatGPT-User`	Per-conversation fetch when a user clicks a citation
Perplexity	`PerplexityBot`	Index + Perplexity Shopping retrieval
Anthropic	`ClaudeBot` · `Claude-Web`	Claude tool-use catalog retrieval
Google AI	`Google-Extended`	Gemini, AI Overview, AI Mode (separate from `Googlebot`)
Apple AI	`Applebot-Extended`	Apple Intelligence shopping queries
Bing AI	`bingbot` (with `nocache`)	Copilot answers (Bing's main bot doubles as the AI surface)

If your robots.txt singles out any of these for Disallow: / — even with the rest of the file fully open — the corresponding AI shopping surface skips your catalog. Allow them all. A blanket User-agent: * with no targeted blocks is the cleanest posture and what we score for full credit.

Why AI shopping agents care

Pre-flight check, hard-fail. Every well-behaved AI crawler fetches /robots.txt first. A failed fetch (403, timeout) or a body that disallows the path it wanted means the agent does not fetch — full stop. Your Product JSON-LD is impeccable, your sitemap is perfect, and none of it gets scored, because no crawl ever happens.
Targeted blocks compound across surfaces. Some operators block GPTBot while leaving everything else open, hoping to opt out of LLM training. The same block also opts you out of ChatGPT Shopping product retrieval — separate use, same crawler. The unintended consequence: Perplexity, Google AI, and Claude still see you, but the largest AI shopping surface does not.
Trust signal in tied rankings. A clean, well-formed robots.txt that explicitly welcomes the major AI agents reads as "this brand wants to be found by AI shoppers" — a positive prior on tied ranking decisions. A robots.txt with a long list of Disallows feels defensive, even when no rule actively blocks the bot.
Sitemap discovery anchor. Robots.txt is also where you advertise your sitemap. The Sitemap: directive points crawlers at /sitemap.xml; missing it doesn't kill discovery (most agents check the conventional path anyway) but it's a one-line confidence add.

How to test it on your store

Two curls. From any terminal:

curl -sI https://yourstore.com/robots.txt | head -3
curl -s  https://yourstore.com/robots.txt

Three things to verify:

HTTP 200 (or 404). A 200 is the normal case; a 404 means no robots.txt exists, which is fully permissive (every URL is crawlable). A 403 is the bug — usually Cloudflare Bot Fight Mode or a WAF rule intercepting bot traffic. A 301/302 is half-credit on low-reliability bots; emit the file at the canonical hostname.
No blanket Disallow. Read the body. Inside the User-agent: * block, there must be no Disallow: /, no Disallow: /products, no Disallow: /products.json, no Disallow: /collections, no Disallow: /pages, no Disallow: /blogs. Disallow: /admin, /checkout, /cart, /account, and /*?sort_by* are all fine.
No targeted AI-bot blocks. Search the body for GPTBot, PerplexityBot, Google-Extended, ClaudeBot, Applebot-Extended. If any of them have their own User-agent: rule with Disallow: /, that's a deliberate (or copy-pasted) block. Either remove the rule or change to Allow: /.

The free CatalogScan scan checks all three plus a deeper "what would each major bot get if it tried to fetch /products.json right now" simulation — so a User-agent: * open posture with a hidden Cloudflare WAF challenge gets caught even when the static text reads fine.

How to fix it

Default Shopify: edit robots.txt.liquid in the theme5 minfree

Shopify generates robots.txt from a Liquid template. To override: in your theme code, create templates/robots.txt.liquid. The recommended starting point is the Shopify-documented default — copy it verbatim, then add only what you need on top. Most stores need to change zero rules. To explicitly welcome AI agents (recommended), add this block at the top:

{%- comment -%} Welcome AI shopping agents explicitly {%- endcomment -%}
User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: Applebot-Extended
Allow: /

Then ensure the User-agent: * block keeps the default Shopify Disallow set (admin, checkout, cart, account) and nothing else.

Removing a stuck storefront_password2 minfree

Shopify Admin → Online Store → Preferences → Password protection. Toggle off. The password gate intercepts every request including /robots.txt and returns the gate HTML with HTTP 200 — which fails this signal silently. Re-curl after toggling and verify you see the actual robots.txt body.

Hydrogen / Remix: emit a route file15 minfree

Add app/routes/robots[.]txt.tsx (the brackets escape the dot in Remix's filename convention). Return a plain-text Response with the rule body. Mirror the Shopify default at minimum; add the explicit AI-agent Allow block on top. Remember to set Content-Type: text/plain in the response headers — some Hydrogen scaffolds default to text/html which strict parsers reject. Cache at the edge for 1 hour (Cache-Control: public, max-age=3600); the file rarely changes and you don't want every bot fetch round-tripping to your origin.

Next.js App Router: app/robots.ts10 minfree

Next 13+ has a first-party file convention. Create app/robots.ts exporting a default function returning the MetadataRoute.Robots shape: { rules: [{ userAgent: '*', allow: '/', disallow: ['/admin', '/checkout', '/cart', '/account'] }, ...], sitemap: 'https://store.com/sitemap.xml' }. The framework emits a valid robots.txt at the canonical path automatically. Add separate { userAgent: 'GPTBot', allow: '/' } entries for each AI agent if you want explicit welcomes.

Cloudflare in front of your origin15 minfree

Three Cloudflare features can intercept legitimate AI crawlers and silently fail this signal. (1) Bot Fight Mode — disable for the storefront zone, or carve a Skip → All Bot Fight Mode features rule for known AI user agents. (2) AI Scrapers and Crawlers managed rule — Cloudflare's "block AI bots" toggle (Security → Bots → AI Scrapers); turn it off if you want AI shopping surfaces to see your catalog. (3) Custom WAF rule with cf.client.bot in the expression — review and exclude the AI agent UAs. Test with curl -A "GPTBot/1.0 (+https://openai.com/gptbot)" https://yourstore.com/robots.txt — if you get the file, you're clear; if you get a 403 or a challenge HTML, you're blocking.

5 mistakes we keep finding

1. The `storefront_password` the founder forgot they set

The single most common cause of a failed robots.txt signal on a brand-new launch. The founder turned on a password during pre-launch testing, finished the site, told the world the URL, and forgot to flip it back off. Every URL on the storefront — including /robots.txt, /sitemap.xml, /products.json — returns the password gate as HTTP 200 HTML. Three floor signals fail simultaneously. The fix is one toggle.

2. Cloudflare Bot Fight Mode auto-blocking AI agents

Bot Fight Mode is on by default for many free Cloudflare zones. It serves a JavaScript challenge to anything that looks bot-like — and "looks bot-like" includes every AI shopping crawler that sets a normal UA. The robots.txt body is fine; the WAF is the problem. Cloudflare added an "AI Scrapers and Crawlers" managed rule in 2024 that defaults to off for new zones but on for some legacy ones; verify by reading the active rules under Security → Bots, not by reading robots.txt.

3. Targeted `User-agent: GPTBot` block left over from "no AI training" stance

Late 2023 a wave of "block GPTBot to opt out of LLM training" advice landed in operator newsletters. Stores added the rule and never revisited. By 2026, GPTBot also drives ChatGPT Shopping product retrieval — same crawler, different surface. The rule that opted out of training also opted out of being a candidate when a user asks ChatGPT "where can I buy X." Decide deliberately: if you want AI shopping traffic, Allow: / for these agents.

4. Headless rebuild copied a generic robots.txt scaffold

The Next.js, Astro, and Vercel "generate a robots.txt" tutorials almost universally show User-agent: * + Disallow: /admin + Disallow: /api as the example. Fine for a generic web app — wrong for a Shopify replacement, where you also need to block /checkout, /cart, /account, /orders, the sort-permutation trap, and ideally welcome AI agents explicitly. Mirror the Shopify default at minimum; don't trust generic scaffolds.

5. `Disallow: /products` as a "stealth launch" pattern

Some operators block /products during a soft-launch period, intending to flip it open at launch. The flip never happens, or happens but the CDN cache holds the old robots.txt for 24 hours, or someone copies the staging robots.txt to production by mistake. Either way: the entire catalog is invisible while every other ranking signal is in good shape. Use storefront_password for stealth launches, not Disallow; the password gate is unambiguous, and removing it is one toggle that you can verify with curl in 5 seconds.

Is your robots.txt actually open?

Free 2-minute scan. We fetch /robots.txt with each major AI-agent UA, parse the rule blocks, and flag any blocker before it costs you a single AI-shopping placement.

Scan my store → See all 15 signals