noindex & robots.txt: Control Crawling and Indexing

Q: What does noindex mean?

noindex means that a page should not be added to the search index or should be removed from it.

Q: What is the difference between noindex and robots.txt?

robots.txt controls crawling. noindex controls indexing. A page must be crawlable so Google can see noindex.

Q: How do I protect staging pages?

Ideally with login, HTTP Auth, IP protection or VPN. noindex alone is not robust enough for staging.

A single noindex in the wrong place can remove important pages from Google. An overly restrictive robots.txt file can prevent Google from important content in the first place. And the most critical point: If a page is blocked via robots.txt, Google cannot reliably see a noindex directive placed on that page.

That is exactly why RankScan bundles the insights “ pages” and “Content blocked by ” in the Website Health category with High priority.

These two signals affect the technical foundation of your visibility:

“Noindex pages”: Pages are blocked from .
“Content blocked by robots.txt”: Important content or resources are excluded from crawling via robots.txt.

Both can be intentional. But both can also happen accidentally — especially after relaunches, staging releases, updates or plugin changes.

In this article, you will learn how noindex and robots.txt work, how they differ, which mistakes are especially dangerous and how to respond correctly after a RankScan finding.

robots.txt controls which URLs crawlers are allowed to fetch.
noindex controls whether a crawled page may appear in the search index.
Disallow in robots.txt is not a reliable way to prevent indexing.
For Google to see noindex, the page must be crawlable.
noindex can be set as a or an .
is not the same as noindex.
Sensitive content should not be protected with robots.txt, but with login, password protection or server-side access control.
Staging pages should ideally be protected by authentication, not only by noindex.
A good check verifies whether important pages are accidentally set to noindex or blocked by robots.txt.
Particularly critical errors affect indexable pages, the , main navigation, product categories, service pages and guide articles.

Crawling vs. Indexing: The Most Important Difference #

To use noindex and robots.txt correctly, you need to distinguish between two processes.

Crawling #

During crawling, a bot fetches a URL. The bot downloads (Hypertext Markup Language, the markup language for web pages), images, (Cascading Style Sheets, the language used for web page layout), JavaScript or other resources.

The robots.txt file can control which URLs a crawler is allowed to fetch.

Google describes robots.txt as a file that tells search engine crawlers which URLs on a site they may access. It is mainly used to manage crawler traffic and avoid unnecessary server load.
Source: Google Search Central – robots.txt Introduction and Guide

Indexing #

During indexing, Google processes crawled content and decides whether a URL should be added to the search index.

noindex controls exactly this step: the page may be crawled, but should not appear in the search index.

Google explains that noindex can be used as a meta tag or header (Hypertext Transfer Protocol, the web’s transfer protocol) to prevent indexing. Important: the page must not be blocked by robots.txt, otherwise Google cannot see the noindex instruction.
Source: Google Search Central – Block Search indexing with noindex

What Does noindex Mean? #

noindex is a robots directive that tells search engines:

This page should not appear in search results.

Typical HTML implementation in the <head>:

html

<meta name="robots" content="noindex, follow">

This means:

noindex: Do not index the page.
follow: Links on the page may be followed.

follow is usually not strictly required, but is often used to make the intention clear.

Google documents that robots meta tags and X-Robots-Tags can only be read if crawlers have access to the page.
Source: Google Search Central – Robots Meta Tags

Meta noindex vs. X-Robots-Tag #

There are two important ways to set noindex.

1. Meta Robots Tag in the HTML #

For normal HTML pages:

html

<meta name="robots" content="noindex">

or:

html

<meta name="robots" content="noindex, follow">

The tag belongs in the page’s <head>.

2. X-Robots-Tag in the HTTP Header #

For PDFs, images, files or server-side rules, noindex can be set in the HTTP header:

http

X-Robots-Tag: noindex

This is especially useful for:

PDFs,
Word documents,
images,
download files,
automatically generated files,
entire file types.

Google points out that the X-Robots-Tag is especially useful for controlling the indexing of non-HTML files.
Source: Google Search Central – Meta tags and attributes that Google supports

What Does “Noindex Pages” Mean? #

The RankScan insight “Noindex pages” means: RankScan has found pages that are excluded from indexing via noindex.

That can be correct.

Examples of intentionally set noindex:

internal search results,
cart,
checkout,
login,
thank-you pages,
filter pages without search value,
thin archive pages,
staging or test pages,
internal documentation,
duplicate or low-value pages.

It becomes problematic when important pages are affected:

homepage,
service pages,
product pages,
category pages,
guide articles,
location pages,
landing pages,
pages in the sitemap,
pages with organic rankings,
pages with backlinks,
pages with .

A good check therefore does not only report that noindex exists, but also assesses whether it makes sense for the page type.

What Does “Content Blocked by robots.txt” Mean? #

The RankScan insight “Content blocked by robots.txt” means: The robots.txt file blocks content or resources that could be relevant for crawling, rendering or visibility.

Typical examples:

text

User-agent: *
Disallow: /

or:

text

User-agent: *
Disallow: /blog/

or:

text

User-agent: *
Disallow: /assets/

This can cause problems if important pages, CSS, JavaScript, images or other resources can no longer be crawled.

In its technical requirements, Google explains that a blocked URL can still appear in search results if Google discovers it through links. To prevent a page from being indexed, noindex should be used — and Google must be able to crawl the URL.
Source: Google Search Central – Technical requirements

The Critical Mistake: robots.txt Blocks noindex #

The most common misunderstanding:

text

User-agent: *
Disallow: /old-page/

and at the same time on the page:

html

<meta name="robots" content="noindex">

At first glance, this looks doubly safe. In reality, it is contradictory.

If Google is not allowed to crawl the page because of robots.txt, Google cannot see the noindex tag on that page. The URL may still appear in search results, for example if other pages link to it.

Google explicitly states: For noindex to work, the page must not be blocked by robots.txt and must be accessible to the crawler.
Source: Google Search Central – Block Search indexing with noindex

The rule is:

If a page should leave the index: keep it crawlable and set noindex.

robots.txt Is Not a Security Mechanism #

The robots.txt file is publicly accessible:

text

https://example.ch/robots.txt

It is not password protection and not access control.

Not suitable for:

confidential documents,
customer data,
internal files,
staging content,
private PDFs,
admin areas with sensitive information.

For sensitive content, you need:

login,
password protection,
HTTP Auth,
server-side access control,
VPN (Virtual Private Network),
IP restriction,
secure role and permission management.

Robots.txt is based on cooperation. Reputable crawlers respect it, but it does not protect content from direct access.

noindex, nofollow, index follow: What Means What? #

These terms are often confused.

Directive	Meaning
`index`	Page may be indexed
`noindex`	Page should not be indexed
`follow`	Links on the page may be followed
`nofollow`	Links on the page should not be followed
`noindex, follow`	Do not index the page, still follow links
`noindex, nofollow`	Do not index the page and do not follow links
`index, nofollow`	Index the page, do not follow links

In practice, noindex, follow is often more useful than noindex, nofollow when a page should not be indexed but its internal links are still useful for crawling and orientation.

In an older blog post, Google recommends combining multiple robots values in one meta tag where possible to avoid conflicts.
Source: Google Search Central Blog – Using the robots meta tag

noindex in robots.txt? Do Not Use It #

Sometimes you see rules like:

text

User-agent: *
Noindex: /internal/

This is not a reliable Google rule.

In 2019, Google clarified that unsupported robots.txt rules such as noindex, nofollow and crawl-delay would no longer be supported as unofficial rules.
Source: Google Search Central Blog – A note on unsupported rules in robots.txt

If you want to prevent indexing, use:

html

<meta name="robots" content="noindex">

or:

http

X-Robots-Tag: noindex

Not:

text

Noindex: /path/

Using robots.txt Correctly #

The robots.txt file is useful when you want to control crawling.

Example:

text

User-agent: *
Disallow: /admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /search/

Sitemap: https://example.ch/sitemap.xml

Typical useful applications:

exclude technical areas from crawling,
reduce internal search result crawling,
control low-value parameter areas,
reduce server load,
manage specific bots,
specifically allow or block crawlers for Artificial Intelligence (AI),
reference sitemaps.

Important: Do not generally block CSS or JavaScript files if Google needs them for .

When to Use noindex, robots.txt, Canonical or Redirect? #

These tools serve different purposes.

Goal	Best method
Page should not appear in Google but remain accessible	noindex
Crawlers should not fetch the URL	robots.txt
Similar URLs should consolidate to the main version	Canonical
Old URL should permanently point to a new URL
Content has been permanently deleted	404 or 410
Protect sensitive content	Login / password protection
PDF should not be indexed	X-Robots-Tag: noindex

Prioritisation: Which Cases Are Really Critical? #

Not every noindex or robots.txt finding has the same relevance.

Situation	Priority	Why
Homepage set to noindex	High	entire visibility at risk
important service page set to noindex	High	commercial page disappears from Google
product/category pages set to noindex	High	revenue relevance
blog or guide directory blocked by robots.txt	High	content cannot be crawled
important URLs in sitemap but noindex	High	contradictory signals
robots.txt blocks CSS/JS (JavaScript) for rendering	High	Google may see the page incompletely
staging disallow on live site	High	entire website can become invisible
internal search noindex	Low to medium	often correct
cart/checkout noindex or disallow	Low	usually intentional
irrelevant filters noindex	Medium	depends on search potential
blocked	depends on strategy	relevant for , not necessarily classic search

The most important rule:

First check whether indexable, important, internally linked or sitemap-relevant pages are affected.

Content Error or Template Error? #

A good check distinguishes whether an issue affects individual pages or entire page types.

Individual Issue #

Examples:

a landing page was accidentally set to noindex,
a single article has an incorrect setting,
an old campaign page is blocked,
a single PDF has an incorrect header.

Solution: Fix the page or file.

Template or System Issue #

Examples:

all blog articles are noindex,
category pages inherit a staging setting,
robots.txt blocks an entire directory,
an SEO plugin sets global noindex,
shop filters are broadly blocked incorrectly,
the head template outputs incorrect meta robots,
(Content Delivery Network, a network of distributed servers for faster delivery) or WAF (Web Application Firewall) blocks relevant bots.

Solution: Fix the template, CMS configuration, plugin, deployment process or infrastructure.

System issues usually have higher priority because they affect many URLs.

Typical Causes of noindex Errors #

1. Staging Settings Were Carried Over #

During development, noindex is correct. On the live site, it is fatal.

Common causes:

WordPress setting “Discourage search engines”,
environment variable set incorrectly,
SEO plugin copied from staging,
robots rules not adjusted.

2. CMS or SEO Plugin Misconfigured #

A plugin can set entire content types to noindex:

categories,
tags,
author pages,
product archives,
custom post types,
search pages.

3. Template Outputs Incorrect Robots Meta Tags #

Example:

html

<meta name="robots" content="noindex">

is placed in the global head and therefore output on every page.

4. Canonical/noindex Conflicts #

A page is named as the main version via canonical, but is also noindex. That is contradictory.

5. JavaScript or Tag Manager Sets Meta Robots #

Meta robots should not be generated or changed client-side without control. For search engines, the directive must be reliably detectable.

Typical Causes of robots.txt Errors #

1. Global Disallow from Staging #

text

User-agent: *
Disallow: /

Useful on staging. Critical in production.

2. Wrong Directory Blocked #

text

Disallow: /blog/

when the blog should be indexed.

3. Assets Blocked #

text

Disallow: /assets/
Disallow: /js/
Disallow: /css/

This can make rendering and quality evaluation harder.

4. Parameters Blocked Too Broadly #

Sometimes filter or parameter pages are blocked so broadly that important landing pages are affected.

5. Crawler-Specific Rules Set Incorrectly #

Example:

text

User-agent: Googlebot
Disallow: /

This directly blocks Google and is usually fatal on a live site.

noindex and robots.txt for AI Crawlers #

Robots.txt does not only control ; it can also affect AI crawlers.

Examples of AI user agents:

If you block AI crawlers broadly, this can affect AI Visibility. At the same time, it can be strategically sensible to limit training while allowing live retrieval.

This topic is closely related to the article Controlling AI Crawlers: robots.txt & .

Important for this article:

robots.txt controls crawling.
noindex controls classic search indexing.
AI crawlers can be blocked via robots.txt.
Blocking AI crawlers is not automatically bad, but it should be intentional.
The normal Googlebot should not be accidentally blocked if classic Google visibility should be maintained.

What to Do After a RankScan Finding #

When RankScan reports “Noindex pages” or “Content blocked by robots.txt”, proceed in a structured way.

Step 1: Group Affected URLs #

Check:

individual URLs,
page types,
directories,
templates,
sitemaps,
product areas,
blog areas,
staging/test paths,
assets.

Step 2: Define the Indexing Goal #

For each group, ask:

Should this URL appear in Google?
Should this URL be crawled?
Should this URL only be accessible to users?
Should this URL be protected?

Only then decide between noindex, robots.txt, canonical, or login.

Step 3: Identify Contradictions #

Watch for combinations such as:

Combination	Problem
noindex + in sitemap	contradictory
noindex + canonical to itself	unclear depending on goal
canonical to noindex target	problematic
robots.txt blocks page with noindex	Google cannot see noindex
robots.txt blocks CSS/JS	rendering can suffer
noindex on important landing page	visibility loss
Disallow on entire blog	content is not crawled

Step 4: Fix the Cause #

Typical actions:

remove noindex from important pages,
make robots.txt rules more precise,
remove staging rules from production,
clean up the sitemap,
remove noindex pages from the sitemap,
make CSS/JS crawlable again,
resolve canonical/noindex conflicts,
check SEO plugin configuration,
correct the template,
check WAF/CDN bot rules.

Step 5: Check in Google Search Console #

Use:

URL Inspection Tool,
page indexing,
crawl stats,
robots.txt test,
live test,
sitemap report.

Google Search Console shows, among other things, whether a URL is excluded by noindex or whether crawling is blocked by robots.txt.

Step 6: Crawl Again After the Fix #

After the correction, check:

Has noindex been removed from important pages?
Have noindex pages been removed from the sitemap?
Is robots.txt no longer too restrictive?
Are CSS/JS accessible?
Are important URLs indexable?
Are there still blocked contents?
Has Google been prompted to recheck?

What a Good Indexing Check Looks For #

A good indexing check should do more than simply report “noindex found”.

A good check verifies:

pages with noindex,
important pages with noindex,
noindex pages in the sitemap,
noindex on canonical URLs,
canonical pointing to noindex target,
robots.txt blocking important content,
robots.txt blocking whole directories,
robots.txt blocking CSS/JS,
Disallow: / on the live site,
Googlebot-specific blocks,
AI crawler blocks,
noindex in the meta tag,
noindex in the X-Robots-Tag,
contradictory robots meta tags,
robots.txt + noindex conflicts,
status code and indexability,
mobile/desktop differences,
template or CMS patterns.

This turns “Noindex pages” and “Content blocked by robots.txt” into concrete website-health tasks.

Example: Invisible Relaunch Caused by noindex #

Situation #

A website is prepared on a staging system. The following is correctly set there:

html

<meta name="robots" content="noindex, nofollow">

After launch, this setting remains active.

RankScan reports:

“Noindex pages”
“Content blocked by robots.txt”
“Declining rankings”

Analysis #

important service pages are noindex,
several URLs are still in the sitemap,
robots.txt still contains Disallow: /staging-assets/,
some JS files are blocked.

Solution #

Remove noindex from production pages.
Correct the SEO plugin configuration.
Regenerate the sitemap.
Clean up robots.txt.
Allow CSS/JS.
Protect staging via HTTP Auth in future.
Use Google Search Console URL inspection.
Crawl again with RankScan.

Result #

The pages are crawlable and indexable again. The issue was not a content problem, but a technical deployment error.

Common Mistakes #

Mistake 1: Using robots.txt as Indexing Protection #

Disallow prevents crawling, not reliably indexing.

Mistake 2: Combining noindex and robots.txt #

If Google is not allowed to crawl a page, Google cannot see noindex.

Mistake 3: Publishing the Live Site with Staging Rules #

Disallow: / or global noindex do not belong on the production website.

Mistake 4: Leaving noindex Pages in the Sitemap #

The sitemap should only contain URLs that are intended to be indexed.

Mistake 5: Blocking CSS and JavaScript #

Google needs to be able to render pages. Important resources should be crawlable.

Mistake 6: Using nofollow Unnecessarily #

nofollow on internal pages can unnecessarily weaken internal link signals and crawling.

Mistake 7: Using robots.txt for Sensitive Data #

Robots.txt is public. Protection requires authentication or server-side restrictions.

Checklist: Checking noindex and robots.txt #

Use this checklist:

Are there important pages with noindex?
Are there noindex pages in the sitemap?
Is there Disallow: / on the live site?
Does robots.txt block important directories?
Does robots.txt block CSS, JS or images?
Are important pages crawlable?
Are pages that should leave the index crawlable and noindex?
Are there canonical/noindex conflicts?
Are there multiple contradictory robots meta tags?
Are X-Robots-Tags on PDFs set correctly?
Are staging pages protected by authentication?
Have SEO plugin and CMS settings been checked?
Has the setup been tested in Google Search Console?
Has the site been crawled again after the fix?

In addition, 404 errors help narrow down the cause cleanly and prioritize the next SEO actions.

FAQ (Frequently Asked Questions) About noindex and robots.txt #

What does noindex mean?

noindex means that a page should not be added to the search index or should be removed from it.

What is the difference between noindex and robots.txt?

robots.txt controls crawling. noindex controls indexing. A page must be crawlable so Google can see noindex.

Can I remove pages from Google with robots.txt?

Not reliably. robots.txt prevents crawling, but not necessarily indexing. To exclude a page from indexing, use noindex and allow crawling.

What does noindex, follow mean?

The page should not be indexed, but links on the page may be followed.

What does noindex, nofollow mean?

The page should not be indexed, and links on the page should not be followed.

Should I set internal search results to noindex?

Often yes. Internal search results often create thin or redundant pages without search value.

Should I block filter pages via robots.txt?

It depends on the setup. Some filters should be noindex or canonicalised; others can remain crawlable. If Google is supposed to see canonical or noindex, the URL must not be blocked by robots.txt.

How do I protect staging pages?

Ideally with login, HTTP Auth, IP protection or VPN. noindex alone is not robust enough for staging.

Why does Google still show a robots-blocked page?

If other pages link to it, Google can know the URL and may show it even if the content is not allowed to be crawled.

What does “Content blocked by robots.txt” mean in RankScan?

The insight means that robots.txt blocks content or resources that could be relevant for crawling, rendering or visibility.

What does “Noindex pages” mean in RankScan?

The insight means that pages are excluded from indexing via noindex. The next step is to check whether this is intentional or accidental.

Conclusion: Separate Crawling and Indexing Cleanly #

noindex and robots.txt are not interchangeable tools. If you confuse them, you risk removing important pages from Google or preventing Google from reading decisive signals in the first place.

The most important rule is:

robots.txt controls crawling. noindex controls indexing.

For RankScan, the insights “Noindex pages” and “Content blocked by robots.txt” are therefore high-priority signals. Not because every noindex page is wrong, but because accidental blocks can immediately cost visibility.

The best approach is:

group affected URLs,
define the indexing goal,
check noindex and robots.txt rules,
identify contradictions,
release important pages,
intentionally control unimportant pages,
align sitemaps, canonicals and internal links,
test in Google Search Console,
crawl again after the fix.

This turns indexing control from a risk into a controlled part of your website-health strategy.

Crawling vs. Indexing: The Most Important Difference #

Crawling #

Indexing #

What Does noindex Mean? #

Meta noindex vs. X-Robots-Tag #

1. Meta Robots Tag in the HTML #

2. X-Robots-Tag in the HTTP Header #

What Does “Noindex Pages” Mean? #

What Does “Content Blocked by robots.txt” Mean? #

The Critical Mistake: robots.txt Blocks noindex #

robots.txt Is Not a Security Mechanism #

noindex, nofollow, index follow: What Means What? #

noindex in robots.txt? Do Not Use It #

Using robots.txt Correctly #

When to Use noindex, robots.txt, Canonical or Redirect? #

Prioritisation: Which Cases Are Really Critical? #

Content Error or Template Error? #

Individual Issue #

Template or System Issue #

Typical Causes of noindex Errors #

1. Staging Settings Were Carried Over #

2. CMS or SEO Plugin Misconfigured #

3. Template Outputs Incorrect Robots Meta Tags #

4. Canonical/noindex Conflicts #

5. JavaScript or Tag Manager Sets Meta Robots #

Typical Causes of robots.txt Errors #

1. Global Disallow from Staging #

2. Wrong Directory Blocked #

3. Assets Blocked #

4. Parameters Blocked Too Broadly #

5. Crawler-Specific Rules Set Incorrectly #

noindex and robots.txt for AI Crawlers #

What to Do After a RankScan Finding #

Step 1: Group Affected URLs #

Step 2: Define the Indexing Goal #

Step 3: Identify Contradictions #

Step 4: Fix the Cause #

Step 5: Check in Google Search Console #

Step 6: Crawl Again After the Fix #

What a Good Indexing Check Looks For #

Example: Invisible Relaunch Caused by noindex #

Situation #

Analysis #

Solution #

Result #

Common Mistakes #

Mistake 1: Using robots.txt as Indexing Protection #

Mistake 2: Combining noindex and robots.txt #

Mistake 3: Publishing the Live Site with Staging Rules #

Mistake 4: Leaving noindex Pages in the Sitemap #

Mistake 5: Blocking CSS and JavaScript #

Mistake 6: Using nofollow Unnecessarily #

Mistake 7: Using robots.txt for Sensitive Data #

Checklist: Checking noindex and robots.txt #

FAQ (Frequently Asked Questions) About noindex and robots.txt #

Conclusion: Separate Crawling and Indexing Cleanly #

Sources and Further Reading #

Related posts

Generative Engine Optimization: Improve AI Visibility Systematically

TL;DR & Key Takeaways: Make Key Points Easy to Scan

Create an XML Sitemap: A Guide for Search Engines