How to Block GPTBot, CCBot, and ClaudeBot (And When Not To) | SlugGenius

Since large language models began training on web-scraped data at scale, most major AI companies have published a dedicated crawler user-agent that respects robots.txt directives — at least for compliant bots. If you want to opt your site out of AI training data collection, or out of being summarized by an AI answer engine without attribution, robots.txt is the primary lever available to you.

The major AI crawler user-agents

GPTBot — OpenAI's crawler, used to collect training data for future models.
ChatGPT-User — a separate OpenAI agent used when ChatGPT fetches a page live in response to a user's request, distinct from training crawls.
CCBot — operated by Common Crawl, a dataset widely used as training data by many AI labs, not just one company.
Google-Extended — controls whether your content can be used to improve Google's Gemini models and AI Overviews, separate from the regular Googlebot directive that controls search indexing.
ClaudeBot / anthropic-ai — Anthropic's crawlers for Claude's training data.
PerplexityBot — used by Perplexity to fetch and summarize pages in response to user queries.

How to block them

Each crawler is blocked the same way any user-agent is blocked in robots.txt — a dedicated User-agent block with Disallow: /:

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

The Robots.txt Generator includes a one-click preset that adds these disallow rules for all major AI crawlers without you needing to track each user-agent string manually — useful since the list keeps growing as new labs launch crawlers.

The important caveat: robots.txt is voluntary

Robots.txt is an honor-system protocol. It works because reputable crawlers choose to respect it — there is no technical enforcement mechanism stopping a bot from ignoring the file entirely. Some smaller or less scrupulous scrapers do exactly that. If you need a harder guarantee, server-level blocking by IP range or user-agent string (via a CDN or web server config) is a stronger control, though it requires more maintenance as crawler IP ranges change.

Why you might not want to block everything

Blocking training crawlers (GPTBot, CCBot, ClaudeBot) and blocking retrieval crawlers (ChatGPT-User, PerplexityBot) have very different consequences. Blocking training crawlers simply opts your content out of future model training — invisible to your traffic either way. Blocking retrieval crawlers means your page can no longer be fetched, summarized, or cited when a user asks an AI assistant a question your content answers — which, for a lot of sites, is a referral traffic channel worth keeping open, not closing off by default.

This is precisely the inverse problem of llms.txt: a well-formed llms.txt file proactively invites AI systems to cite your best content, while a blanket robots.txt block does the opposite. Most sites are better served by a deliberate middle ground — disallow training-only crawlers if you're protective of your content, but leave retrieval crawlers open so you remain citable in AI-generated answers.

There's no universally correct answer here; it depends on whether you view AI visibility as a distribution channel worth optimizing for (the GEO approach this site is built around) or as something to opt out of entirely. Decide deliberately, rather than defaulting to whatever preset looked safest.

How to Block AI Crawlers Like GPTBot, CCBot, and ClaudeBot (And When You Shouldn't)

The major AI crawler user-agents

How to block them

The important caveat: robots.txt is voluntary

Why you might not want to block everything

Free SEO tools you might like