AI companies train models on crawled web content, and most expose a robots.txt-compatible user-agent so site owners can opt out. Here's how blocking works, and why blocking isn't always the right call.
Since large language models began training on web-scraped data at scale, most major AI companies have published a dedicated crawler user-agent that respects robots.txt directives — at least for compliant bots. If you want to opt your site out of AI training data collection, or out of being summarized by an AI answer engine without attribution, robots.txt is the primary lever available to you.
Each crawler is blocked the same way any user-agent is blocked in robots.txt — a dedicated User-agent block with Disallow: /:
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
The Robots.txt Generator includes a one-click preset that adds these disallow rules for all major AI crawlers without you needing to track each user-agent string manually — useful since the list keeps growing as new labs launch crawlers.
Robots.txt is an honor-system protocol. It works because reputable crawlers choose to respect it — there is no technical enforcement mechanism stopping a bot from ignoring the file entirely. Some smaller or less scrupulous scrapers do exactly that. If you need a harder guarantee, server-level blocking by IP range or user-agent string (via a CDN or web server config) is a stronger control, though it requires more maintenance as crawler IP ranges change.
Blocking training crawlers (GPTBot, CCBot, ClaudeBot) and blocking retrieval crawlers (ChatGPT-User, PerplexityBot) have very different consequences. Blocking training crawlers simply opts your content out of future model training — invisible to your traffic either way. Blocking retrieval crawlers means your page can no longer be fetched, summarized, or cited when a user asks an AI assistant a question your content answers — which, for a lot of sites, is a referral traffic channel worth keeping open, not closing off by default.
This is precisely the inverse problem of llms.txt: a well-formed llms.txt file proactively invites AI systems to cite your best content, while a blanket robots.txt block does the opposite. Most sites are better served by a deliberate middle ground — disallow training-only crawlers if you're protective of your content, but leave retrieval crawlers open so you remain citable in AI-generated answers.
There's no universally correct answer here; it depends on whether you view AI visibility as a distribution channel worth optimizing for (the GEO approach this site is built around) or as something to opt out of entirely. Decide deliberately, rather than defaulting to whatever preset looked safest.
No sign-up required — use them instantly in your browser.