The Complete robots.txt Guide: Syntax, Directives, and Best Practices

What Is robots.txt?

robots.txt is a plain text file that tells web crawlers which parts of your site they are allowed to access. It sits at the root of your domain (https://example.com/robots.txt) and is one of the oldest standards on the web, dating back to 1994.

When a well-behaved crawler arrives at your site, the first thing it does is check for a robots.txt file. The directives inside tell it which URLs are off-limits and which are fair game. The crawler then respects those rules as it indexes your content.

robots.txt is not a security mechanism. It is a set of polite instructions that legitimate crawlers choose to follow. Malicious bots and scrapers will ignore it entirely. But for search engines like Google, Bing, and others, it is the primary way you control crawl behavior. For a broader perspective on website maintenance, see our Website Maintenance and Monitoring Guide.

How Crawlers Use robots.txt

The process is straightforward:

A crawler arrives at https://example.com.
Before crawling any pages, it fetches https://example.com/robots.txt.
It looks for rules that match its User-agent string.
It follows the Disallow and Allow directives for its matching User-agent.
It notes any Sitemap declarations for later processing.

If no robots.txt file exists (returns 404), the crawler assumes it can access everything. If the file returns a server error (5xx), most crawlers will temporarily pause crawling until they can fetch the file successfully.

Google treats a missing robots.txt (404 response) as permission to crawl everything. But if your robots.txt returns a 5xx error, Googlebot will limit crawling until it can confirm the rules. A broken robots.txt can effectively pause your site's indexing.

Location Requirements

Your robots.txt file must be:

Located at the root of your domain: https://example.com/robots.txt
Accessible via HTTP/HTTPS (not inside a subdirectory)
A plain text file (Content-Type: text/plain)
UTF-8 encoded

Each subdomain needs its own robots.txt. The file at example.com/robots.txt does not apply to blog.example.com -- that subdomain needs its own file at blog.example.com/robots.txt.

Similarly, different protocols are separate. If your site serves both http:// and https:// (which it should not, but some do), each needs its own robots.txt.

robots.txt Syntax

The syntax is simple but precise. A robots.txt file consists of one or more "groups," each starting with a User-agent: line followed by one or more directives.

User-agent

The User-agent: line specifies which crawler the following rules apply to.

User-agent: Googlebot

Common User-agent values:

User-agent	Crawler	Owner
Googlebot	Google web search	Google
Bingbot	Bing web search	Microsoft
Slurp	Yahoo search	Yahoo
DuckDuckBot	DuckDuckGo search	DuckDuckGo
GPTBot	OpenAI web crawler	OpenAI
Google-Extended	Google AI training	Google
CCBot	Common Crawl	Common Crawl
*	All crawlers (wildcard)	Universal

The wildcard * matches any crawler that does not have more specific rules. Most sites only need a User-agent: * block.

Disallow

The Disallow: directive blocks access to a URL path. Crawlers matching the User-agent will not access URLs starting with this path.

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

Rules for Disallow::

The path is case-sensitive. /Admin/ and /admin/ are different.
The path matches from the start of the URL path. Disallow: /private/ blocks /private/, /private/page1, /private/data/file.html, etc.
An empty Disallow: means "disallow nothing" (allow everything).
Disallow: / blocks the entire site.

Allow

The Allow: directive overrides a Disallow: for a more specific path. It is useful when you want to block a directory but permit specific pages within it.

User-agent: *
Disallow: /api/
Allow: /api/public/

This blocks all /api/ paths except those under /api/public/.

When both Allow and Disallow match a URL, the more specific (longer) rule wins. If they are the same length, Allow takes precedence. This is the Google standard -- other crawlers may handle conflicts differently.

Sitemap

The Sitemap: directive tells crawlers where to find your XML sitemap. Unlike other directives, it is not tied to a specific User-agent block.

Sitemap: https://example.com/sitemap.xml

You can list multiple sitemaps:

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

The Sitemap directive must use the full absolute URL including the protocol.

Wildcards and Pattern Matching

Google and Bing support two wildcard characters in robots.txt directives that are not part of the original 1994 standard but are widely used:

Asterisk (*) -- Matches any sequence of characters.

Disallow: /search?q=*
Disallow: /*.pdf$

Dollar sign ($) -- Matches the end of the URL.

Disallow: /*.php$

This blocks all URLs ending in .php while allowing URLs like /page.php/comments.

Monitor Your robots.txt Automatically

Site Watcher tracks your robots.txt for changes, errors, and misconfigurations. Get alerted when crawler directives change unexpectedly.

Common Directives and Examples

Allow all crawlers full access

User-agent: *
Disallow:

An empty Disallow: means nothing is blocked. This is the most permissive configuration.

Block all crawlers from entire site

User-agent: *
Disallow: /

Useful for staging environments or sites under development. Do not use this in production unless you intentionally want to de-index your site.

Block specific directories

User-agent: *
Disallow: /admin/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /internal/

Block specific crawlers

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow: /admin/

This blocks GPTBot and CCBot from the entire site while allowing all other crawlers with standard restrictions.

Complex rules with Allow overrides

User-agent: Googlebot
Disallow: /api/
Allow: /api/docs/
Disallow: /internal/
Allow: /internal/press/

User-agent: *
Disallow: /api/
Disallow: /internal/

Googlebot gets access to /api/docs/ and /internal/press/, while all other crawlers are blocked from both directories entirely.

What robots.txt Does Not Do

Understanding the limitations is as important as understanding the functionality.

It Is Not a Security Tool

robots.txt does not prevent access to content. It is a request, not an enforcement mechanism. Anyone (or any bot) can still access URLs listed in Disallow: directives.

If you need to actually protect content, use:

Authentication (login required)
Server-side access controls (IP restrictions, firewalls)
HTTP authentication (Basic or Digest auth)

Never put sensitive URLs in your robots.txt file. Ironically, listing a secret path in robots.txt makes it more discoverable, because attackers routinely scan robots.txt files to find hidden directories.

It Does Not Remove Pages from Google

Blocking a URL with robots.txt does not remove it from Google's index. If Google already knows about the URL (from external links, for example), it may continue to show it in search results -- just without a page description, because it cannot crawl the content.

To remove a page from Google:

Use the noindex meta tag or X-Robots-Tag header.
Use Google Search Console's URL removal tool for temporary removals.
Return a 404 or 410 status code for permanent removals.

It Does Not Control Indexing

robots.txt controls crawling, not indexing. A page can appear in search results even if robots.txt blocks crawling, as long as Google knows the URL exists from other sources.

For indexing control, use:

<meta name="robots" content="noindex"> to prevent indexing
<meta name="robots" content="nofollow"> to prevent following links on the page
X-Robots-Tag HTTP header for non-HTML resources

A Complete robots.txt Example

Here is a well-structured robots.txt file for a typical business website:

# Robots.txt for example.com
# Last updated: 2026-02-15

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=

# Allow search engine crawlers to access API docs
User-agent: Googlebot
Allow: /api/docs/

User-agent: Bingbot
Allow: /api/docs/

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Sitemap locations
Sitemap: https://example.com/sitemap.xml

Testing Your robots.txt

Before deploying changes to your robots.txt, test them to ensure they work as intended.

Use Google's robots.txt Tester

Google Search Console includes a robots.txt testing tool. Enter a URL and it shows whether that URL is blocked or allowed for Googlebot. Note: this tool may be deprecated in favor of the URL Inspection tool.

Use the URL Inspection Tool

In Google Search Console, enter any URL and check whether Googlebot can access it. This tests the real-world effect of your robots.txt rules.

Manual verification

Open your robots.txt file in a browser (https://yourdomain.com/robots.txt) to confirm it is accessible and formatted correctly.

Check for syntax errors

Look for common mistakes: missing colons, incorrect paths, typos in User-agent names. Online robots.txt validators can catch formatting issues.

Verify crawler behavior

After deploying, monitor your server logs or Google Search Console to confirm crawlers are respecting your rules. Unexpected crawling of blocked paths indicates a problem.

Common robots.txt Mistakes

Blocking CSS and JavaScript files. Google needs access to your CSS and JS to render pages correctly. Blocking these resources can hurt your rankings because Google cannot see your page as users see it.

Blocking your sitemap. If your robots.txt blocks the path where your sitemap lives, crawlers cannot access it. Make sure the sitemap URL is not caught by any Disallow rules.

Using robots.txt to hide pages from Google. As covered above, this does not work. The page can still appear in search results with a truncated listing. Use noindex instead.

Forgetting the trailing slash. Disallow: /admin blocks /admin, /admin/, and /administrator. Disallow: /admin/ only blocks paths starting with /admin/. Be precise.

Not testing after changes. A single syntax error can change the meaning of your entire file. Always test before and after deployment.

For related guidance, see how to create a robots.txt file, how to block AI crawlers, and sitemap best practices. Google's robots.txt specification notes that a 5xx response on robots.txt will cause Googlebot to temporarily pause crawling. You can validate your directives using robots.txt testing tools.

robots.txt is the first thing search engines read when they visit your site -- make sure it says exactly what you intend.

Monitor robots.txt and More

Site Watcher tracks your robots.txt, sitemap, SSL certificates, DNS records, and uptime from one dashboard. Free for 3 targets. Unlimited monitoring for $39/mo.