How to Block AI Crawlers with Robots.txt
Learn how to block AI crawlers like GPTBot, CCBot, and Bytespider using robots.txt directives. Includes a full AI bot list and monitoring tips.
Last updated: 2026-02-17
Why AI Crawlers Are Hitting Your Site
Every day, AI companies send bots to crawl the open web. They scrape your content to train large language models, build search indexes, or power AI-generated answers. Unlike traditional search engine crawlers that send you traffic in return, most AI crawlers take your content and give nothing back.
The result: your server resources get consumed, your original content gets absorbed into AI models, and you have zero control over how it gets used downstream. For publishers, SaaS companies, and anyone who creates original content, this is a real problem.
The good news is that you can block most of these bots using the same robots.txt file you already maintain for search engines. The bad news is that not all AI crawlers respect robots.txt, and keeping track of new bots requires ongoing attention.
Which AI Crawlers to Block
The AI crawler landscape changes fast. Here are the major bots you should know about, grouped by the company behind them.
| Bot Name | Company | Purpose |
|---|---|---|
| GPTBot | OpenAI | Training data for GPT models |
| ChatGPT-User | OpenAI | Real-time browsing for ChatGPT |
| OAI-SearchBot | OpenAI | SearchGPT / ChatGPT search results |
| CCBot | Common Crawl | Open dataset used by many AI labs |
| Google-Extended | Training data for Gemini / Bard | |
| anthropic-ai | Anthropic | Training data for Claude models |
| ClaudeBot | Anthropic | Web fetching for Claude features |
| Bytespider | ByteDance | Training data for TikTok AI products |
| Amazonbot | Amazon | Alexa / AI assistant training |
| FacebookBot | Meta | AI training and content indexing |
| PerplexityBot | Perplexity | AI-powered search answers |
| Cohere-ai | Cohere | Enterprise LLM training |
| Applebot-Extended | Apple | Apple Intelligence training data |
This list is not exhaustive. New bots appear regularly, and some scrapers do not identify themselves honestly in their user-agent strings.
How to Block AI Crawlers in Robots.txt
The robots.txt file sits at the root of your domain (e.g., https://example.com/robots.txt). Search engine and AI crawlers check this file before crawling your pages. To block a specific bot, you add a User-agent directive followed by a Disallow rule.
Block All AI Crawlers
To block every known AI crawler, add a block for each bot's user-agent string. There is no wildcard that targets only AI bots without also affecting search engines, so you need to list them individually.
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Cohere-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
Block AI Crawlers from Specific Directories
If you want to allow AI bots to access some content (like your marketing pages) but protect blog posts or documentation, use targeted disallow rules.
User-agent: GPTBot
Disallow: /blog/
Disallow: /docs/
Disallow: /resources/
User-agent: CCBot
Disallow: /blog/
Disallow: /docs/
Disallow: /resources/
Allow Only Specific AI Bots
Maybe you want to block most AI crawlers but allow one that sends referral traffic. You can mix Allow and Disallow directives for the same bot.
User-agent: PerplexityBot
Allow: /blog/
Disallow: /
User-agent: GPTBot
Disallow: /
Robots.txt is a voluntary protocol. Well-behaved bots from major companies respect it, but rogue scrapers and smaller AI startups may ignore it entirely. Robots.txt is your first line of defense, not your only one.
Monitor Your Robots.txt
Site Watcher tracks your robots.txt for unauthorized changes and alerts you when something shifts. Free for up to 3 targets.
Beyond Robots.txt: Other Blocking Methods
Robots.txt works for compliant bots. For everything else, you need additional layers.
Server-Level Blocking
You can block user-agent strings at the web server level. In Nginx:
if ($http_user_agent ~* (GPTBot|CCBot|Bytespider|anthropic-ai|ClaudeBot)) {
return 403;
}
In Apache .htaccess:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|CCBot|Bytespider|anthropic-ai|ClaudeBot) [NC]
RewriteRule .* - [F,L]
This returns a 403 Forbidden response instead of relying on the bot to voluntarily obey robots.txt.
Rate Limiting
If you do not want to block AI crawlers entirely, rate limiting can reduce their impact on your server. Tools like Cloudflare's Bot Management, AWS WAF, or nginx rate limiting can throttle requests from known AI bot IP ranges.
AI-Specific Meta Tags
Some AI companies also respect page-level meta tags:
<meta name="robots" content="noai, noimageai">
Google supports the notranslate and nosnippet directives to limit how content appears in AI-generated answers. However, support for these tags varies widely across AI providers.
The Pros and Cons of Blocking AI Crawlers
Before you block everything, consider the tradeoffs.
| Pros of Blocking | Cons of Blocking |
|---|---|
| Protects original content from unauthorized use | Your content won't appear in AI-powered search answers |
| Reduces server load from aggressive crawlers | Competitors who allow crawling may get cited instead |
| Maintains control over content distribution | Some AI search engines (Perplexity, SearchGPT) drive traffic |
| Prevents content from being used without attribution | Blocking CCBot may affect non-AI services that use Common Crawl |
The right decision depends on your business model. If you sell content (media, research, courses), blocking makes sense. If you rely on organic discovery and AI search is becoming a referral channel, selective blocking may be smarter.
Monitoring AI Crawler Activity
Blocking AI crawlers is only useful if you verify that your blocks are working. Here is what to monitor.
Robots.txt Integrity
Server Access Logs
Crawl Budget Impact
New Bot Detection
How to Audit Your Current Robots.txt
If you are not sure what your current robots.txt allows, run through this quick checklist.
Fetch Your Robots.txt
https://yourdomain.com/robots.txt in a browser. Confirm it loads correctly and is not returning a 404 or empty response.Check for AI Bot Directives
Validate Syntax
User-agent line is followed by at least one Disallow directive. A User-agent block with no rules is treated as allowing everything.Test with Google's Tool
Set Up Monitoring
Keeping Your AI Crawler Blocks Current
The AI crawler ecosystem is evolving quickly. A robots.txt file that was comprehensive six months ago may be missing five new bots today. Here is a practical approach.
Maintain a master list of AI crawler user-agent strings. Review it monthly. Cross-reference it against published bot directories from OpenAI, Google, Anthropic, and others. When a new major AI product launches, check whether it comes with a new crawler.
Automate what you can. Set up alerts for changes to your robots.txt file so you know immediately if a deployment overwrites your AI bot rules. Monitor your server logs for unfamiliar user-agent strings that might indicate new AI crawlers.
The companies behind these bots update their crawling practices regularly. OpenAI, for example, has introduced multiple user-agent strings over time (GPTBot, then ChatGPT-User, then OAI-SearchBot). Each one needs its own entry in your robots.txt.
Blocking AI crawlers starts with robots.txt, but staying protected requires ongoing monitoring to catch new bots, accidental rule deletions, and crawlers that ignore the protocol entirely.
Keep Your AI Crawler Blocks Working
Site Watcher monitors your robots.txt file for changes and alerts you when directives are added, removed, or overwritten. $39/mo unlimited targets, free for 3.