Robots.txt Guide: Rules, Mistakes & AI Crawlers

Robots.txt is a text file placed at the root of your website that tells crawlers — search engines, AI systems, and other bots — which pages they are allowed to visit. It is one of the simplest files on the web, yet a single mistake can have catastrophic consequences: accidentally blocking Google from your entire site will cause your pages to disappear from search results within days.

This guide covers the correct robots.txt syntax, the most common mistakes, how to block AI crawlers like GPTBot and ClaudeBot, and how to test your file before deploying it.

How robots.txt works

Robots.txt is not a security mechanism — it is a courtesy protocol. Compliant crawlers (Google, Bing, and most reputable bots) read the file before crawling your site and follow its rules. Non-compliant crawlers ignore it entirely. You cannot use robots.txt to prevent a determined bad actor from accessing your content; it only governs bots that choose to respect it.

The file must be placed at yourdomain.com/robots.txt — exactly that URL, no subdirectory. Google fetches it with every crawl and caches the rules for up to 24 hours before re-fetching.

Robots.txt syntax

The file is made up of blocks called “records,” each beginning with a User-agent line that identifies which bot the rules apply to, followed by one or more Disallow or Allow lines.

Allow all crawlers to access everything (most sites)

User-agent: *
Disallow:

Sitemap: https://example.com/sitemap.xml

An empty Disallow value means no pages are blocked. This is the correct configuration for most public-facing websites. Adding a Sitemap directive helps crawlers discover your content faster.

Block crawlers from a specific directory

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://example.com/sitemap.xml

Block a specific crawler entirely

User-agent: BadBot
Disallow: /

How to block AI crawlers

Several AI companies have deployed their own web crawlers to collect training data. If you want to prevent your content from being used to train AI models, you can block these crawlers by name in robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: *
Allow: /

Sitemap: https://example.com/sitemap.xml

Important: these rules only affect compliant crawlers. OpenAI’s GPTBot and Anthropic’s ClaudeBot are documented as respecting robots.txt. However, many third-party scrapers that may supply AI training data do not. Robots.txt is a meaningful signal, not a guarantee of enforcement.

Robots.txt vs noindex: what is the difference?

This is one of the most common points of confusion in SEO:

Disallow in robots.txt — prevents the crawler from visiting the page. If the page was already indexed before you added the Disallow rule, it may remain in the index with no title or description because Google can no longer access the content.
noindex meta tag — allows the crawler to visit the page but instructs it not to include the page in search results. This is the correct way to remove a page from Google’s index.

Never use robots.txt to block pages you also want deindexed. Use <meta name="robots" content="noindex"> on those pages instead, and leave robots.txt open so Googlebot can read the noindex instruction.

Common robots.txt mistakes

Disallow: / for all user-agents — this blocks every crawler from every page on your site. If deployed accidentally (common after CMS migrations), it can wipe your entire organic presence within days.
Blocking CSS and JavaScript files: If Googlebot cannot access your stylesheets and scripts, it cannot render your pages accurately, which harms how Google evaluates your content. Never disallow /assets/, /static/, or similar resource directories.
Trailing slash inconsistency: Disallow: /admin and Disallow: /admin/ behave differently. Without a trailing slash, the rule applies to any URL starting with /admin, including /administrator. Use trailing slashes for directories.
No Sitemap directive: Including your sitemap URL in robots.txt is optional but recommended. It gives crawlers a fast path to your full page index without having to discover pages through internal links alone.

How to generate and test your robots.txt

The SlugGenius Robots.txt Generator includes preset configurations for the most common scenarios: allow all crawlers, block all crawlers, or block AI crawlers specifically. You can select multiple user-agents, add a crawl delay, and set your sitemap URL — the tool outputs a valid, ready-to-deploy robots.txt file with no syntax errors.

After deploying your file, test it using Google Search Console’s robots.txt tester (found under Settings › robots.txt) or the standalone robots.txt tester at search.google.com. Paste your file and test individual URLs to confirm that the rules behave exactly as intended before Google re-crawls your site.

A correct robots.txt is invisible — it causes no problems and you never think about it. An incorrect one can silently destroy months of SEO work. Get it right once and revisit it any time you restructure your site.

How to Write a Robots.txt File (And Why One Mistake Can Wipe Your Rankings)