How to Block AI Crawlers with Robots.txt

Why AI Crawlers Are Hitting Your Site

Every day, AI companies send bots to crawl the open web. They scrape your content to train large language models, build search indexes, or power AI-generated answers. Unlike traditional search engine crawlers that send you traffic in return, most AI crawlers take your content and give nothing back.

The result: your server resources get consumed, your original content gets absorbed into AI models, and you have zero control over how it gets used downstream. For publishers, SaaS companies, and anyone who creates original content, this is a real problem.

The good news is that you can block most of these bots using the same robots.txt file you already maintain for search engines. The bad news is that not all AI crawlers respect robots.txt, and keeping track of new bots requires ongoing attention. For a broader perspective on website maintenance, see our Website Maintenance and Monitoring Guide.

Which AI Crawlers to Block

The AI crawler landscape changes fast. Here are the major bots you should know about, grouped by the company behind them.

Bot Name	Company	Purpose
GPTBot	OpenAI	Training data for GPT models
ChatGPT-User	OpenAI	Real-time browsing for ChatGPT
OAI-SearchBot	OpenAI	SearchGPT / ChatGPT search results
CCBot	Common Crawl	Open dataset used by many AI labs
Google-Extended	Google	Training data for Gemini / Bard
anthropic-ai	Anthropic	Training data for Claude models
ClaudeBot	Anthropic	Web fetching for Claude features
Bytespider	ByteDance	Training data for TikTok AI products
Amazonbot	Amazon	Alexa / AI assistant training
FacebookBot	Meta	AI training and content indexing
PerplexityBot	Perplexity	AI-powered search answers
Cohere-ai	Cohere	Enterprise LLM training
Applebot-Extended	Apple	Apple Intelligence training data

This list is not exhaustive. New bots appear regularly, and some scrapers do not identify themselves honestly in their user-agent strings.

How to Block AI Crawlers in Robots.txt

The robots.txt file sits at the root of your domain (e.g., https://example.com/robots.txt). Search engine and AI crawlers check this file before crawling your pages. To block a specific bot, you add a User-agent directive followed by a Disallow rule.

Block All AI Crawlers

To block every known AI crawler, add a block for each bot's user-agent string. There is no wildcard that targets only AI bots without also affecting search engines, so you need to list them individually.

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Cohere-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Block AI Crawlers from Specific Directories

If you want to allow AI bots to access some content (like your marketing pages) but protect blog posts or documentation, use targeted disallow rules.

User-agent: GPTBot
Disallow: /blog/
Disallow: /docs/
Disallow: /resources/

User-agent: CCBot
Disallow: /blog/
Disallow: /docs/
Disallow: /resources/

Allow Only Specific AI Bots

Maybe you want to block most AI crawlers but allow one that sends referral traffic. You can mix Allow and Disallow directives for the same bot.

User-agent: PerplexityBot
Allow: /blog/
Disallow: /

User-agent: GPTBot
Disallow: /

Robots.txt is a voluntary protocol. Well-behaved bots from major companies respect it, but rogue scrapers and smaller AI startups may ignore it entirely. Robots.txt is your first line of defense, not your only one.

Monitor Your Robots.txt

Site Watcher tracks your robots.txt for unauthorized changes and alerts you when something shifts. Free for up to 3 targets.

Beyond Robots.txt: Other Blocking Methods

Robots.txt works for compliant bots. For everything else, you need additional layers.

Server-Level Blocking

You can block user-agent strings at the web server level. In Nginx:

if ($http_user_agent ~* (GPTBot|CCBot|Bytespider|anthropic-ai|ClaudeBot)) {
    return 403;
}

In Apache .htaccess:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|CCBot|Bytespider|anthropic-ai|ClaudeBot) [NC]
RewriteRule .* - [F,L]

This returns a 403 Forbidden response instead of relying on the bot to voluntarily obey robots.txt.

Rate Limiting

If you do not want to block AI crawlers entirely, rate limiting can reduce their impact on your server. Tools like Cloudflare's Bot Management, AWS WAF, or nginx rate limiting can throttle requests from known AI bot IP ranges.

AI-Specific Meta Tags

Some AI companies also respect page-level meta tags:

<meta name="robots" content="noai, noimageai">

Google supports the notranslate and nosnippet directives to limit how content appears in AI-generated answers. However, support for these tags varies widely across AI providers.

The Pros and Cons of Blocking AI Crawlers

Before you block everything, consider the tradeoffs.

Pros of Blocking	Cons of Blocking
Protects original content from unauthorized use	Your content won't appear in AI-powered search answers
Reduces server load from aggressive crawlers	Competitors who allow crawling may get cited instead
Maintains control over content distribution	Some AI search engines (Perplexity, SearchGPT) drive traffic
Prevents content from being used without attribution	Blocking CCBot may affect non-AI services that use Common Crawl

The right decision depends on your business model. If you sell content (media, research, courses), blocking makes sense. If you rely on organic discovery and AI search is becoming a referral channel, selective blocking may be smarter.

Monitoring AI Crawler Activity

Blocking AI crawlers is only useful if you verify that your blocks are working. Here is what to monitor.

Robots.txt Integrity

Check that your robots.txt file has not been accidentally modified during a deployment. A missing disallow rule means bots can crawl freely.

Server Access Logs

Review your server logs for AI bot user-agent strings. If you see GPTBot requests hitting pages you blocked, your rules may not be formatted correctly.

Crawl Budget Impact

Monitor whether AI bots are consuming crawl budget that should go to Googlebot. Excessive AI crawling can slow down legitimate indexing.

New Bot Detection

New AI crawlers appear regularly. A monitoring tool that alerts you to unfamiliar user-agents helps you stay ahead.

How to Audit Your Current Robots.txt

If you are not sure what your current robots.txt allows, run through this quick checklist.

Fetch Your Robots.txt

Visit https://yourdomain.com/robots.txt in a browser. Confirm it loads correctly and is not returning a 404 or empty response.

Check for AI Bot Directives

Search the file for known AI bot user-agent strings: GPTBot, CCBot, Google-Extended, anthropic-ai, Bytespider. If none are listed, AI crawlers have full access.

Validate Syntax

Ensure each User-agent line is followed by at least one Disallow directive. A User-agent block with no rules is treated as allowing everything.

Test with Google's Tool

Use Google Search Console's robots.txt tester to validate syntax. It will not test AI bots specifically, but it catches formatting errors.

Set Up Monitoring

Use a monitoring tool to track changes to your robots.txt file. Deployments, CMS updates, and CDN misconfigurations can silently remove your AI bot blocks.

Keeping Your AI Crawler Blocks Current

The AI crawler ecosystem is evolving quickly. A robots.txt file that was comprehensive six months ago may be missing five new bots today. Here is a practical approach.

Maintain a master list of AI crawler user-agent strings. Review it monthly. Cross-reference it against published bot directories from OpenAI, Google, Anthropic, and others. When a new major AI product launches, check whether it comes with a new crawler.

Automate what you can. Set up alerts for changes to your robots.txt file so you know immediately if a deployment overwrites your AI bot rules. Monitor your server logs for unfamiliar user-agent strings that might indicate new AI crawlers.

The companies behind these bots update their crawling practices regularly. OpenAI, for example, has introduced multiple user-agent strings over time (GPTBot, then ChatGPT-User, then OAI-SearchBot). Each one needs its own entry in your robots.txt.

For related guidance, see the complete robots.txt guide, how to create a robots.txt file, and the website security monitoring guide. Google's robots.txt specification defines the standard that crawlers follow, and you can validate your file using robots.txt testing tools.

Blocking AI crawlers starts with robots.txt, but staying protected requires ongoing monitoring to catch new bots, accidental rule deletions, and crawlers that ignore the protocol entirely.

Keep Your AI Crawler Blocks Working

Site Watcher monitors your robots.txt file for changes and alerts you when directives are added, removed, or overwritten. $39/mo unlimited targets, free for 3.