llms.txt and AI-Crawler Allow-Listing: The New robots.txt for AI Search

Most small-business sites have a robots.txt file. Almost none of the owners we audit have looked at it in two years. That was fine when Googlebot was the only crawler that mattered. It's not fine now. There's a new set of bots knocking, and on a surprising number of platforms the default config tells them to get lost.

The bots are GPTBot, ChatGPT-User, ClaudeBot, PerplexityBot, Google-Extended, and OAI-SearchBot. Six crawlers, six different jobs. Block the wrong ones and you're invisible to ChatGPT Search, Perplexity, and Google's AI Overview. Allow all of them and you might be feeding your content into a training corpus you'd rather opt out of. Most owners pick a side without knowing they're picking.

There's also a new file most sites don't have at all. It's called llms.txt. It lives at the root of your site like robots.txt does, and it tells AI models which pages on your site are worth reading. Adoption was 10% across 300k domains in a recent SE Ranking survey, up from near-zero a year earlier. Anthropic, Cloudflare, and Vercel ship one. Yours probably should too.

This post walks through both files. Six bots, what each one does, paste-ready snippets for robots.txt, and an example llms.txt you can adapt in twenty minutes.

Why this matters now

Two recent things made this urgent.

First, Cloudflare started rolling out a "Managed robots.txt" feature that blocks GPTBot, ClaudeBot, and other AI crawlers by default for sites that opt in. The intent is reasonable - give site owners one toggle to control AI scraping. The side effect is that thousands of sites are now blocking the same crawlers that index them for ChatGPT Search and Perplexity, without the owner ever editing a file. We've seen audits where the client had no idea their host had flipped the switch.

Second, WP Engine started rate-limiting ClaudeBot and GPTBot at the platform level, with no customer toggle to turn it off. If you're on WP Engine and you wanted to be cited in Claude or ChatGPT, you might already be losing requests at the edge before the bot ever reaches your content.

The thing that used to be set-it-and-forget-it isn't anymore. Your hosting platform may be making decisions on your behalf right now. Pull up the live file, see what's actually there, then decide what you want.

The 6 AI crawlers, plain English

Here's what each one does and why you'd care.

GPTBot is OpenAI's training crawler. It walks the web collecting pages OpenAI uses to train future models. Allow it and your content can appear in the training data for GPT-5 and beyond. Block it and it can't. One thing to know: GPTBot does NOT power ChatGPT Search citations. That's a different bot, see below.

ChatGPT-User is OpenAI's live-fetch bot. It fires when a ChatGPT user clicks a link in a response or asks ChatGPT to "go read this page." It's not crawling the whole web, it's fetching one URL on behalf of one user, in real time. Block this and ChatGPT can't read pages a user explicitly asked it to read. I cannot think of a good reason to block ChatGPT-User.

ClaudeBot (sometimes also called anthropic-ai in older robots.txt files) is Anthropic's crawler. It's how Claude finds and cites your content during a conversation. Anthropic has been clear that ClaudeBot honors robots.txt. Block it and Claude can't reach you.

PerplexityBot is the index crawler Perplexity uses to power its answers. Perplexity is the smallest of the four AI surfaces by query volume, but its citation rate is high. A Perplexity citation usually links straight back to your URL with attribution, which means the click-through is real.

Google-Extended is the unusual one. It's not really a crawler, it's a switch. Google introduced it in September 2023 specifically so site owners could opt out of letting their content train Gemini without also disappearing from regular Google search. Disallow Google-Extended and Googlebot still crawls you for Search. Only Gemini training is blocked. This separation is genuinely useful and the only reason it's called out here.

OAI-SearchBot is OpenAI's search-specific crawler. This is the one that builds the index ChatGPT Search uses to answer real-time queries with citations. OAI-SearchBot is what gets your business cited when someone asks ChatGPT "best Italian restaurant in Madison." If you only allow ONE of OpenAI's bots, this is the one.

The split that matters: GPTBot is for training, OAI-SearchBot is for live citations. Most SMBs we audit don't want their content in training sets but absolutely want to be cited. So the right policy is usually "allow OAI-SearchBot, disallow GPTBot." Snippet for that exact config is below.

robots.txt: paste-ready snippets

Your robots.txt lives at https://yoursite.com/robots.txt. If it doesn't exist, create it. It's a plain text file with one rule per line.

Here are the two configurations we recommend, depending on what you're optimizing for.

Option A: allow everything (max AI visibility)

Use this if you want maximum visibility across all four AI surfaces and you don't care about your content being in training data. This is the default we recommend for most small businesses, because the visibility gain almost always outweighs the training concern at SMB scale.

User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

Option B: search-yes, training-no (conservative)

Use this if you want to be cited in AI search results but you'd rather not let your content into the training corpus for future models. This is the right call for sites with a lot of original journalism, proprietary research, or premium content.

# Block training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow live-search and citation crawlers
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml

What this does: GPTBot can't crawl you for training, Google-Extended can't pull you into Gemini training, but OAI-SearchBot, ClaudeBot, and PerplexityBot can still index you for live citations. Googlebot (regular search) is unaffected by Google-Extended, so your Google rankings don't change.

What NOT to do

The mistake we see most often is a blanket User-agent: * Disallow: / left over from a staging environment that nobody updated when the site went live. That blocks everything - Googlebot included - and we've seen sites that have been invisible to all crawlers for months without realizing it. Read your live robots.txt once a quarter.

The second mistake is blocking ChatGPT-User. There's no good reason to do it. The only thing it accomplishes is preventing ChatGPT from reading pages your customers explicitly ask it to read.

llms.txt: the new standard

Robots.txt tells crawlers which pages they're allowed to fetch. It says nothing about which pages are worth reading, which sections matter, or what your business actually does. That's the gap llms.txt fills.

Jeremy Howard from Answer.AI proposed /llms.txt in September 2024. The idea: a markdown file at the root of your site that gives AI models a curated map of your most important content. A sitemap for humans, plus a one-paragraph executive summary of your business, written in plain English, designed to be read by an LLM that's trying to figure out who you are and what you do.

The spec defines two files:

/llms.txt - the curated index. One page, with links to your most important content.
/llms-full.txt - the full text of your key pages, concatenated. Optional. Bigger sites use it, most SMBs don't need to.

Adoption is real. Anthropic, Cloudflare, Vercel, and Mintlify ship llms.txt files today. SE Ranking found 10% of 300k domains had one in their last survey, up from near-zero a year earlier. It is becoming the de facto answer to "how do I tell AI models what my site is about."

What goes in llms.txt

The structure is simple markdown:

An H1 with the project or business name
A blockquote with a one-paragraph description
Optional context paragraphs
H2 sections grouping links by category, with each link as a markdown bullet - [Page name](url): one-line description

Here's a paste-ready example for a small business. Adapt the text, but keep the structure.

# Acme Plumbing

> Acme Plumbing is a family-owned plumbing company serving
> Cook County, IL. We do residential repair, drain cleaning,
> water heater install, and 24/7 emergency service.
> Licensed since 1998. Cash, card, and major insurance accepted.

We are based in Oak Park, IL. Service area: Oak Park, River Forest,
Forest Park, Berwyn, Cicero, and the western half of Chicago.
Our pricing is flat-rate, quoted before any work begins.

## Services

- [Residential Plumbing Repair](https://acmeplumbing.com/residential-repair): Leaks, pipe repair, fixture install, toilet repair.
- [Drain Cleaning](https://acmeplumbing.com/drain-cleaning): Sewer rodding, hydro jetting, camera inspection.
- [Water Heater Installation](https://acmeplumbing.com/water-heaters): Tank and tankless. Same-day install in most cases.
- [Emergency Plumbing](https://acmeplumbing.com/emergency): 24/7 dispatch. 60-minute response in core service area.

## About

- [About Acme](https://acmeplumbing.com/about): Company history, team, licensing.
- [Service Area Map](https://acmeplumbing.com/service-area): Specific zip codes covered.
- [Reviews](https://acmeplumbing.com/reviews): 340+ Google reviews, 4.9 average.

## FAQ

- [Pricing FAQ](https://acmeplumbing.com/faq/pricing): Flat-rate explained, what's included.
- [Emergency FAQ](https://acmeplumbing.com/faq/emergency): What counts as emergency, response times.

## Contact

- [Book Online](https://acmeplumbing.com/book): Self-serve scheduling.
- [Contact Page](https://acmeplumbing.com/contact): Phone, email, hours.

Save that as llms.txt and upload it to the root of your site, so it's reachable at https://yoursite.com/llms.txt. That's it. No schema validation, no submission process. The AI models that respect the format will find it.

Where to host it

Same rules as robots.txt. Web root, plain text (markdown is fine, the file extension is .txt regardless), publicly accessible, no auth wall. If you're on WordPress, the easiest path is uploading the file via FTP or your host's file manager, since WP doesn't natively let you create root-level files. On Squarespace and Wix you'll need to use their custom-file features (or, on some plans, accept that you can't and skip llms.txt). On a static site (Webflow, Astro, Next.js), drop it in your public/static folder.

Do NOT put llms.txt inside a subdirectory. It must be at the root, full stop. https://yoursite.com/files/llms.txt does not count.

What's coming next

The space is moving. Two things to watch:

There's a debate happening right now about whether allowing AI crawlers actually hurts visibility. The argument: if AI answers the question without sending a click, you've traded traffic for a citation that may or may not convert. After running ~200 audits in 2026, my honest take is this argument is overrated for SMBs. Most small businesses don't have the organic click volume to lose much, and they have a lot to gain from being cited as the answer when a customer asks ChatGPT "best [your category] near me." For an enterprise publisher with a huge editorial inventory the math is different. For a 12-employee plumbing company, allow the bots.

There's also a wave of derivative formats popping up. identity.txt, me.txt, agents.txt, each trying to solve a slightly different version of "tell AI models who I am." None of these are standards yet, and most won't survive. Stick with robots.txt and llms.txt for now. If the others get real traction later, they're cheap to add.

How to know your current state

Open a new tab. Type https://yoursite.com/robots.txt and hit enter. Read what's there. If you see anything that says Disallow: / under User-agent: * or under any of the six AI bots above, you have a problem to fix. If the file doesn't load at all, that's also a problem - create one.

Then type https://yoursite.com/llms.txt. If it 404s, you don't have one. That's expected. Use the example above as a starting point.

Our SEO audit at https://cleargradeai.com checks your robots.txt for all 6 major AI bots and flags llms.txt presence and quality automatically. The audit is free, takes 24 hours, and tells you exactly which crawlers are blocked and what to ship to fix it. Worth running before you assume your site is set up correctly.

The 3-step checklist

If you do nothing else this week, do these three things in order:

Audit your current robots.txt. Open yoursite.com/robots.txt in a browser. Confirm you're not blocking any of the 6 AI bots above. If you're on Cloudflare, check whether "Managed robots.txt" is on. If you're on WP Engine, ask support what their AI-bot rate limits look like.
Add Allow rules for the 6 crawlers. Use Option A or Option B from the snippets above. Push it live. Re-check the live URL to confirm the change actually deployed.
Ship an llms.txt. Twenty-minute job. Use the Acme Plumbing template as a starting point. Save as llms.txt, upload to your web root, confirm it loads at yoursite.com/llms.txt.

That's the whole list. The whole exercise takes about an hour for most sites, and it changes whether you show up in AI search at all. We've seen the citation-rate lift in real audits after this exact set of fixes. It's not a hypothetical.

If you want a second set of eyes on the result, our free audit will tell you whether the fixes shipped correctly and whether anything else on your site is blocking AI visibility. Run it after you push the changes. The before/after is the cleanest way to see the lift.

Run a free audit at cleargradeai.com