llms.txt & robots.txt: Control AI Crawlers (2026)

Q: What is an llms.txt?

An llms.txt is a Markdown file in the root directory of a website. It is intended to give language models a short overview of important content, topics and links.

Q: Is llms.txt an official standard?

No. llms.txt is currently a voluntary proposal or de facto format. It can be useful, but it is not a guarantee of better AI visibility.

Q: What is the difference between robots.txt and llms.txt?

robots.txt controls crawler access. llms.txt explains to language models which content is important. One file controls, the other provides orientation.

Q: Will I lose Google rankings if I block Google-Extended?

Not automatically. Google-Extended is not the same as Googlebot. Googlebot remains relevant for classic Google Search. That is why you must not accidentally block Googlebot if you only want to control Google AI usage.

Q: What does “Missing llms.txt” mean?

The insight means that no /llms.txt was found. This is not a classic SEO error, but a sign of unused potential for AI-readable content structure.

Your content can be technically strong, up to date and prepared for — and still barely appear in ChatGPT, Perplexity or other search systems powered by artificial intelligence (AI).

The reason is often not the content itself, but the technical setup: are blocked by robots.txt, firewalls, rules in a , a distributed server network for faster delivery, or bot management. Sometimes this is intentional. Often, however, it happens unintentionally.

That is exactly why the two RankScan insights “Blocked AI Crawlers” and “Missing llms.txt” are so important:

“Blocked AI Crawlers” checks whether relevant AI crawlers can reach your website or important content.
“Missing llms.txt” indicates that no compact, voluntary orientation file for language models was found. This is an optional optimization opportunity, not a critical standards error.

The weighting matters:
Blocked AI crawlers are a critical technical issue. A missing llms.txt is more of a strategic optimization opportunity.

The best solution is therefore not “allow everything” or “block everything”, but a deliberate strategy: Which AI bots are allowed to ? Which should be blocked for training? Which should remain allowed for live retrieval and AI search? And which content should be especially easy for language models to understand?

controls which crawlers may access which areas of your website.
is a voluntary Markdown file that explains important website content to language models; it is not an official web standard like robots.txt.
.txt is currently a proposal or de facto format, but not a binding official web standard.
, and are different OpenAI user agents with different purposes.
Anyone who blocks all AI crawlers across the board protects content from certain uses, but may lose .
If you want to block training but allow live retrieval, you must control user agents separately.
and must not be confused.
robots.txt is not a security mechanism. Real blocking requires server, CDN, login or Web Application Firewall (WAF) rules.
llms.txt replaces neither robots.txt nor the . It is an optional addition and should not be sold as a mandatory signal.

Why AI Crawlers Are a Website Health Topic #

AI search systems use web content in different ways: for training, search, summaries, source links or user-initiated retrieval. For companies, this creates a new technical dependency.

If a relevant crawler cannot reach your website, this can have several consequences:

Content is not considered, or is considered less strongly, in AI search systems.
Your brand is not mentioned in answers.
Competitors appear as sources even though your content would be more authoritative.
Product, service or guide pages are not classified correctly by AI systems.
Monitoring and optimization become harder because it is unclear where the blockage occurs.

The problem is often invisible. A page can be in Google and still be unreachable for specific AI crawlers.

Typical causes include:

restrictive robots.txt,
Web Application Firewall,
Cloudflare or CDN bot management,
rate limiting,
blocked user agents,
blocked IP ranges,
staging rules that accidentally remained live,
security plugins with broad bot blocks.

That is why “Blocked AI Crawlers” should be classified as critical: this is not a cosmetic SEO optimization, but a question of technical accessibility.

What Are AI Crawlers? #

AI crawlers are automated bots that fetch web pages. They are used by AI providers to find content, make it available for search features, retrieve it for answers or — depending on provider and bot — use it for model training.

OpenAI distinguishes several user agents in its own documentation, including GPTBot, OAI-SearchBot and ChatGPT-User. They serve different purposes and can be controlled separately via robots.txt.
Source: OpenAI Platform – Overview of OpenAI Crawlers

For website owners, this distinction is essential. Anyone who blocks all OpenAI bots across the board does not only block training, but may also block search and retrieval functions that are relevant for visibility in ChatGPT.

Training, AI Search and User Retrieval: The Critical Difference #

Not every AI bot does the same thing. For a meaningful robots.txt strategy, you need to distinguish between three use cases.

1. Model Training #

During training, content is used to improve future AI models. Crawlers such as GPTBot or, depending on the provider, other training crawlers are used for this purpose.

If you do not want your content to be used for training, you can block these crawlers in robots.txt.

Example:

text

User-agent: GPTBot
Disallow: /

This can make sense for legal, strategic or publishing-related reasons.

2. AI Search and Source Retrieval #

Some crawlers are used to find and index content for AI search features or to provide it as a source in answers. Examples include OpenAI’s OAI-SearchBot or Perplexity’s .

OpenAI notes that public websites can appear in ChatGPT Search and that website owners should ensure they do not block OAI-SearchBot if they want content to appear in ChatGPT summaries, and sources.
Source: OpenAI Help – Publishers and Developers FAQ

Perplexity describes its own PerplexityBot as a crawler intended to make websites visible and linkable in Perplexity search results. According to Perplexity, this bot is not used to train foundation models.
Source: Perplexity Docs – Perplexity Crawlers

3. User-Initiated Retrieval #

Some bots fetch pages when users enter a specific URL or request. For OpenAI, ChatGPT-User is relevant for this.

This distinction matters because a user may enter your URL in ChatGPT and expect a summary. If retrieval is blocked, ChatGPT cannot directly read the page.

robots.txt: What It Can Do — and What It Cannot Do #

The robots.txt file is located in the root directory of your website:

text

https://example.ch/robots.txt

It gives crawlers instructions on which areas they may access and which they may not. Google describes robots.txt as a tool for controlling crawler traffic, not as a method for reliably keeping web pages out of Google or the web.
Source: Google Search Central – robots.txt Introduction and Guide

A simple example:

text

User-agent: *
Disallow: /internal/

Sitemap: https://example.ch/sitemap.xml

This means: cooperative crawlers should not fetch /internal/. The sitemap also shows where important indexable URLs can be found.

Important:

robots.txt is based on cooperation.
Reputable crawlers usually respect it.
Not all bots reliably follow it.
Confidential content must not be protected only via robots.txt.
Server, login, CDN and WAF rules are more important for real protection.

A 2025 study examined robots.txt compliance among different scrapers and concluded that not all bot categories reliably respect robots.txt. Especially in AI and scraping contexts, robots.txt should therefore not be understood as the only protection mechanism.
Source: arXiv – Scrapers selectively respect robots.txt directives

The Most Important AI Crawlers and User Agents #

These user agents are important when controlling AI crawlers:

User-Agent	Provider	Typical purpose	Recommendation
`GPTBot`	OpenAI	Model training	Allow or block depending on data strategy
`OAI-SearchBot`	OpenAI	ChatGPT Search / search retrieval	Usually allow for ChatGPT visibility
`ChatGPT-User`	OpenAI	User-initiated retrieval	Usually allow for direct URL retrieval
`ClaudeBot`	Anthropic	Crawling by Anthropic	Review depending on data strategy
`PerplexityBot`	Perplexity	Perplexity search and source links	Usually allow for visibility in Perplexity
`Google-Extended`	Google	Control of certain AI uses by Google	Do not confuse with Googlebot
`Googlebot`	Google	Classic Google Search	Do not block if should be preserved

Anthropic also describes the option for website owners to block access via robots.txt.
Source: Anthropic Support – Does Anthropic crawl data from the web?

Googlebot vs. Google-Extended: The Most Common Mistake #

A particularly important point is the distinction between Googlebot and Google-Extended.

Googlebot is relevant for classic Google Search.
Google-Extended is a control token for certain AI uses by Google.

Google explains that Google-Extended is not a separate user agent of the (HTTP). Crawling happens through existing Google user agents; Google-Extended is used in robots.txt as a product token.
Source: Google Crawling Infrastructure – Google-Extended

Example:

text

User-agent: Google-Extended
Disallow: /

This rule is different from:

text

User-agent: Googlebot
Disallow: /

The second rule would endanger your classic Google visibility. Therefore:

If you want to restrict Google AI usage, do not accidentally block Googlebot.

GPTBot robots.txt: How to Control OpenAI Correctly #

OpenAI crawlers should be handled in a differentiated way.

Block GPTBot #

If you do not want GPTBot to crawl your content:

text

User-agent: GPTBot
Disallow: /

Allow OAI-SearchBot #

If you want your content to remain accessible for ChatGPT Search:

text

User-agent: OAI-SearchBot
Allow: /

Allow ChatGPT-User #

If users should be able to retrieve your pages in ChatGPT:

text

User-agent: ChatGPT-User
Allow: /

Combined Strategy #

Many companies want to restrict training but preserve visibility in AI search:

text

User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

This is often more sensible than a blanket block of all OpenAI crawlers.

Three Strategies for robots.txt and AI Crawlers #

There is no universally correct setting. The right strategy depends on how openly your content may be used.

Strategy A: Maximum AI Visibility #

This strategy is suitable for websites that want to be as open as possible to AI search systems.

text

User-agent: GPTBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

Sitemap: https://example.ch/sitemap.xml

Suitable for:

guide portals,
SaaS websites,
public documentation,
companies with a strong thought leadership focus,
brands that want maximum discoverability.

Risk: Depending on the provider, content may also be used for purposes you do not control.

Strategy B: Block Training, Allow AI Search #

This strategy is suitable for companies that do not want to release content for model training but still want to remain visible in AI search systems.

text

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://example.ch/sitemap.xml

Suitable for:

B2B companies,
publishers with a differentiated data strategy,
brands with ,
companies that want to use AI search but limit training.

Risk: The exact separation between training, search and retrieval depends on the respective provider and may change.

Strategy C: Protect Sensitive Areas #

This strategy is suitable for websites where public content should remain visible, but specific areas must not be crawled.

text

User-agent: *
Disallow: /internal/
Disallow: /staging/
Disallow: /downloads/confidential/

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Sitemap: https://example.ch/sitemap.xml

Important: Confidential content must not be protected only via robots.txt. If content is truly private, it needs login protection, server-side rules or WAF configurations.

What Is an llms.txt? #

The llms.txt is a Markdown file in the root directory of a website:

text

https://example.ch/llms.txt

It is intended to give language models a compact overview of important content. The idea was proposed in 2024 by Jeremy Howard and Answer.AI. The file is designed as human- and machine-readable Markdown and is intended to help large language models (LLMs) understand relevant website content more quickly.
Sources: llms.txt – The /llms.txt file, Answer.AI – /llms.txt proposal

The realistic classification is important:

llms.txt is currently a proposal or voluntary de facto format. It is not a binding official web standard and not a guarantee of better rankings or AI citations.

Nevertheless, an llms.txt can be useful because it explains complex websites to LLMs in a more structured way.

llms.txt vs. robots.txt: The Difference #

Feature	robots.txt	llms.txt
Purpose	Control access	Explain content
Question	May a bot crawl?	Which content is important?
Format	Text directives	Markdown
Status	Established standard	Voluntary proposal
Security effect	Limited, cooperation-based	No protective effect
Typical use	Allow or block crawlers	Give LLMs orientation
Location	`/robots.txt`	`/llms.txt`

In short:

robots.txt decides whether cooperative bots may crawl.
llms.txt explains which content is important.

Why a Missing llms.txt Is an Optimization Opportunity #

The “Missing llms.txt” insight does not mean that your website is technically broken. Classic search engines do not need an llms.txt to crawl your website.

Still, a missing llms.txt can be a disadvantage if your website has many important pieces of content that should be classified cleanly by language models.

An llms.txt is particularly useful for:

SaaS websites,
documentation,
guide portals,
shops with complex categories,
universities and institutions,
B2B companies with services that require explanation,
websites with many similar content areas.

A good llms.txt can highlight important pages:

central service pages,
product and feature pages,
guides,
documentation,
pricing and contact pages,
about pages,
author or expert profiles,
documentation,
important help pages.

Create llms.txt: Structure and Example #

An llms.txt should not be a full sitemap. It is a curated orientation aid.

A useful structure:

Website name
short description
most important topics
central pages
documentation or guides
optional: classification notes

Example:

text

# Example AG

> Example AG is a Swiss provider of B2B software for project planning, resource management and forecasting.

## Important Pages

- [Homepage](https://example.ch/) - Overview of offering and target audiences.
- [Features](https://example.ch/features) - Description of the most important product features.
- [Pricing](https://example.ch/pricing) - Current packages and conditions.
- [Contact](https://example.ch/contact) - Inquiry and consultation.

## Guides

- [Improve project planning](https://example.ch/blog/project-planning) - Basics and best practices.
- [Resource planning in teams](https://example.ch/blog/resource-planning) - Tips for agencies and service providers.

## Notes

This website is aimed at Swiss SMEs, agencies and service companies. Please preferably use the linked pages as sources for current information.

Best Practices for a Good llms.txt #

1. Curate Instead of Linking Everything #

The llms.txt is not a second sitemap. Link only to pages that are important for understanding, authority or conversion.

2. Briefly Explain Every URL #

Every link should include a short description. This helps language models understand why the page is relevant.

3. Use Public and Current Content #

Only link content that is publicly accessible, current and intentionally machine-readable.

4. Do Not Mention Confidential Information #

The llms.txt is publicly accessible. It must not contain internal notes, private URLs or sensitive information.

5. Align It With robots.txt and the Sitemap #

The llms.txt should not point to pages that are blocked in robots.txt, excluded via [noindex](/blog/indexing-noindex-robots-txt) or non-canonical.

6. Maintain It Regularly #

An outdated llms.txt can send misleading signals. It should be part of normal website maintenance.

What a Good AI Crawler Check Looks At #

This topic is not only a content topic, but also a technical website health check.

A good check includes:

Is /robots.txt accessible?
Is /llms.txt present?
Are there blanket blocks for all bots?
Are known AI crawlers explicitly blocked?
Is Googlebot accidentally affected by AI rules?
Is Google-Extended used correctly?
Are GPTBot, OAI-SearchBot and ChatGPT-User controlled separately?
Is the sitemap referenced in robots.txt?
Does the llms.txt contain only public, important and current URLs?
Does the llms.txt link to URLs that are blocked or ?
Are there signs of 401, 403, 429 or 5xx issues for AI crawlers?
Does a firewall block relevant bots despite permissive robots.txt rules?

This makes the point clear: “Missing llms.txt” is only one part of the problem. What matters is the combination of crawler access, technical accessibility and structured orientation.

How to Check Whether AI Crawlers Are Blocked #

1. Open robots.txt Directly #

Open this in the browser:

text

https://your-domain.ch/robots.txt

Look for rules such as:

text

User-agent: GPTBot
Disallow: /

or:

text

User-agent: *
Disallow: /

Such rules can restrict or completely block AI crawlers.

2. Check Server Logs #

Server logs show which bots visit your website and which status codes they receive.

Important status codes:

Status code	Meaning
200	Access successful
301/302
401	Authentication required
403	Access forbidden
404	URL not found
429	Too many requests
5xx	Server error

If OAI-SearchBot, PerplexityBot or other relevant crawlers repeatedly receive 403 responses, this is a clear warning signal.

3. Check Firewall, CDN and Bot Management #

Many blocks do not originate in robots.txt, but at infrastructure level.

Typical systems:

Cloudflare,
Akamai,
Fastly,
WordPress security plugins,
bot management rules,
rate limiting,
country blocks or blocks based on Autonomous System Number (ASN), the identifier of a network range.

In connection with bot management and Managed robots.txt, Cloudflare also points out that AI crawlers are now used for training, search answers and other purposes, and that website owners increasingly need technical control options.
Source: Cloudflare Docs – Managed robots.txt

Common Mistakes With robots.txt, AI Crawlers and llms.txt #

Mistake 1: Blocking All Bots #

text

User-agent: *
Disallow: /

This rule makes sense for staging websites. On a live website, it can prevent classic and AI visibility.

Mistake 2: Blocking Googlebot Instead of Google-Extended #

Anyone who blocks Googlebot endangers classic Google Search. Anyone who wants to control certain Google AI uses must understand and correctly apply Google-Extended.

Mistake 3: Checking Only GPTBot #

Many teams check only GPTBot. But OAI-SearchBot and ChatGPT-User are also relevant for ChatGPT visibility.

Mistake 4: Treating robots.txt as a Security System #

robots.txt is not protection for confidential content. Anyone who wants to protect sensitive data needs server-side access control.

Mistake 5: Selling llms.txt as a Ranking Lever #

An llms.txt is useful for orientation, but it is not a guarantee of better rankings or mentions in AI answers.

Mistake 6: Linking Blocked Pages in llms.txt #

If an llms.txt points to pages that are blocked, not indexable or outdated, contradictory signals are created.

Mistake 7: Skipping Log File Checks #

A robots.txt can look correct while a WAF still blocks access. Without server logs, the problem often remains invisible.

Example: When the Firewall Prevents AI Visibility #

A B2B company ranks well in Google, but is barely mentioned as a source in ChatGPT and Perplexity.

The robots.txt looks clean:

text

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

Nevertheless, the server logs show:

text

OAI-SearchBot    403 Forbidden
PerplexityBot    403 Forbidden

The cause is not in robots.txt, but in the Web Application Firewall. It automatically blocks unknown bots.

The solution:

Identify relevant AI crawlers.
Check the providers’ official documentation.
Adjust WAF rules.
Test bot access again.
Set up monitoring.
Add llms.txt to make central content easier to discover.

The example shows: AI visibility is not only content optimization. It starts with technical accessibility.

Checklist: Control AI Crawlers Deliberately #

Check your website:

Is /robots.txt accessible?
Is /llms.txt present?
Is the sitemap linked in robots.txt?
Is there a blanket block via User-agent: *?
Is the regular Googlebot blocked?
Is Google-Extended used deliberately?
Are OpenAI crawlers controlled separately?
Is OAI-SearchBot allowed if ChatGPT visibility is desired?
Is PerplexityBot allowed if Perplexity visibility is desired?
Are there WAF, CDN or security rules that block AI bots?
Do relevant bots receive 200 status codes?
Are there 401, 403, 429 or 5xx issues?
Does the llms.txt point only to public, current pages?
Are robots.txt, llms.txt, sitemap, canonicals and noindex rules aligned?

In addition, the AI Readiness Score and JavaScript SEO help narrow down the cause cleanly and prioritize the next SEO actions.

Frequently Asked Questions (FAQ) About llms.txt, robots.txt and AI Crawlers #

What is an llms.txt?

An llms.txt is a Markdown file in the root directory of a website. It is intended to give language models a short overview of important content, topics and links.

Is llms.txt an official standard?

No. llms.txt is currently a voluntary proposal or de facto format. It can be useful, but it is not a guarantee of better AI visibility.

What is the difference between robots.txt and llms.txt?

robots.txt controls crawler access. llms.txt explains to language models which content is important. One file controls, the other provides orientation.

Can I block AI crawlers with robots.txt?

Yes, cooperative crawlers can be blocked via robots.txt. For real access security, however, robots.txt rules are not enough.

Should I block GPTBot?

That depends on your strategy. If you want to prevent training, you can block GPTBot. If you want maximum openness, you allow it. The important thing is not to confuse GPTBot with OAI-SearchBot or ChatGPT-User.

Will I lose Google rankings if I block Google-Extended?

Not automatically. Google-Extended is not the same as Googlebot. Googlebot remains relevant for classic Google Search. That is why you must not accidentally block Googlebot if you only want to control Google AI usage.

Does every website need an llms.txt?

Technically, no. But for websites with lots of content, guide sections, documentation or a strategic interest in AI visibility, an llms.txt is useful.

How do I find out whether AI crawlers visit my website?

The most reliable source is server logs. They show user agent, URL, time and status code.

What does “Blocked AI Crawlers” mean?

The insight means that relevant AI crawlers cannot fetch your website or important areas. This can happen because of robots.txt, firewalls, server rules or CDN bot management.

What does “Missing llms.txt” mean?

The insight means that no /llms.txt was found. This is not a classic SEO error, but a sign of unused potential for AI-readable content structure.

Is an llms.txt enough for ChatGPT SEO?

No. For ChatGPT SEO or AI visibility, you primarily need good, accessible, current and trustworthy content. The llms.txt can additionally help make important pages visible in a structured way.

Conclusion: Control AI Crawlers Deliberately Instead of Blocking Them by Accident #

AI visibility does not come only from good content. It requires AI systems to be able to reach and classify your content technically.

The most important first step is therefore the technical check:

Can relevant AI crawlers fetch your website?
Are they blocked by robots.txt?
Are they blocked by firewall, CDN or bot management?
Are training, AI search and user retrieval separated cleanly?
Is there an llms.txt that summarizes central content clearly?

robots.txt is the access control.
llms.txt is the orientation aid.
Server logs and WAF rules show what actually happens.

For many companies, the best strategy is differentiated control:

deliberately allow or block training,
enable live retrieval and AI search where possible,
protect sensitive areas technically,
provide llms.txt as an orientation aid,
regularly check server logs and WAF rules.

This prevents your website from unintentionally becoming invisible to AI systems — and creates better conditions for being considered in AI answers, AI searches and modern search experiences.

Why AI Crawlers Are a Website Health Topic #

What Are AI Crawlers? #

Training, AI Search and User Retrieval: The Critical Difference #

1. Model Training #

2. AI Search and Source Retrieval #

3. User-Initiated Retrieval #

robots.txt: What It Can Do — and What It Cannot Do #

The Most Important AI Crawlers and User Agents #

Googlebot vs. Google-Extended: The Most Common Mistake #

GPTBot robots.txt: How to Control OpenAI Correctly #

Block GPTBot #

Allow OAI-SearchBot #

Allow ChatGPT-User #

Combined Strategy #

Three Strategies for robots.txt and AI Crawlers #

Strategy A: Maximum AI Visibility #

Strategy B: Block Training, Allow AI Search #

Strategy C: Protect Sensitive Areas #

What Is an llms.txt? #

llms.txt vs. robots.txt: The Difference #

Why a Missing llms.txt Is an Optimization Opportunity #

Create llms.txt: Structure and Example #

Best Practices for a Good llms.txt #

1. Curate Instead of Linking Everything #

2. Briefly Explain Every URL #

3. Use Public and Current Content #

4. Do Not Mention Confidential Information #

5. Align It With robots.txt and the Sitemap #

6. Maintain It Regularly #

What a Good AI Crawler Check Looks At #

How to Check Whether AI Crawlers Are Blocked #

1. Open robots.txt Directly #

2. Check Server Logs #

3. Check Firewall, CDN and Bot Management #

Common Mistakes With robots.txt, AI Crawlers and llms.txt #

Mistake 1: Blocking All Bots #

Mistake 2: Blocking Googlebot Instead of Google-Extended #

Mistake 3: Checking Only GPTBot #

Mistake 4: Treating robots.txt as a Security System #

Mistake 5: Selling llms.txt as a Ranking Lever #

Mistake 6: Linking Blocked Pages in llms.txt #

Mistake 7: Skipping Log File Checks #

Example: When the Firewall Prevents AI Visibility #

Checklist: Control AI Crawlers Deliberately #

Frequently Asked Questions (FAQ) About llms.txt, robots.txt and AI Crawlers #

Conclusion: Control AI Crawlers Deliberately Instead of Blocking Them by Accident #

Sources and Further Reading #

Related posts

Generative Engine Optimization: Improve AI Visibility Systematically

TL;DR & Key Takeaways: Make Key Points Easy to Scan

Create an XML Sitemap: A Guide for Search Engines