The Complete robots.txt Guide: Syntax, Directives, and Best Practices
Learn what robots.txt is, how crawler directives work, the correct syntax for User-agent, Disallow, Allow, and Sitemap, and how to test your file.
Last updated: 2026-02-17
What Is robots.txt?
robots.txt is a plain text file that tells web crawlers which parts of your site they are allowed to access. It sits at the root of your domain (https://example.com/robots.txt) and is one of the oldest standards on the web, dating back to 1994.
When a well-behaved crawler arrives at your site, the first thing it does is check for a robots.txt file. The directives inside tell it which URLs are off-limits and which are fair game. The crawler then respects those rules as it indexes your content.
robots.txt is not a security mechanism. It is a set of polite instructions that legitimate crawlers choose to follow. Malicious bots and scrapers will ignore it entirely. But for search engines like Google, Bing, and others, it is the primary way you control crawl behavior.
How Crawlers Use robots.txt
The process is straightforward:
- A crawler arrives at
https://example.com. - Before crawling any pages, it fetches
https://example.com/robots.txt. - It looks for rules that match its User-agent string.
- It follows the Disallow and Allow directives for its matching User-agent.
- It notes any Sitemap declarations for later processing.
If no robots.txt file exists (returns 404), the crawler assumes it can access everything. If the file returns a server error (5xx), most crawlers will temporarily pause crawling until they can fetch the file successfully.
Google treats a missing robots.txt (404 response) as permission to crawl everything. But if your robots.txt returns a 5xx error, Googlebot will limit crawling until it can confirm the rules. A broken robots.txt can effectively pause your site's indexing.
Location Requirements
Your robots.txt file must be:
- Located at the root of your domain:
https://example.com/robots.txt - Accessible via HTTP/HTTPS (not inside a subdirectory)
- A plain text file (Content-Type: text/plain)
- UTF-8 encoded
Each subdomain needs its own robots.txt. The file at example.com/robots.txt does not apply to blog.example.com -- that subdomain needs its own file at blog.example.com/robots.txt.
Similarly, different protocols are separate. If your site serves both http:// and https:// (which it should not, but some do), each needs its own robots.txt.
robots.txt Syntax
The syntax is simple but precise. A robots.txt file consists of one or more "groups," each starting with a User-agent: line followed by one or more directives.
User-agent
The User-agent: line specifies which crawler the following rules apply to.
User-agent: Googlebot
Common User-agent values:
| User-agent | Crawler | Owner |
|---|---|---|
| Googlebot | Google web search | |
| Bingbot | Bing web search | Microsoft |
| Slurp | Yahoo search | Yahoo |
| DuckDuckBot | DuckDuckGo search | DuckDuckGo |
| GPTBot | OpenAI web crawler | OpenAI |
| Google-Extended | Google AI training | |
| CCBot | Common Crawl | Common Crawl |
| * | All crawlers (wildcard) | Universal |
The wildcard * matches any crawler that does not have more specific rules. Most sites only need a User-agent: * block.
Disallow
The Disallow: directive blocks access to a URL path. Crawlers matching the User-agent will not access URLs starting with this path.
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Rules for Disallow::
- The path is case-sensitive.
/Admin/and/admin/are different. - The path matches from the start of the URL path.
Disallow: /private/blocks/private/,/private/page1,/private/data/file.html, etc. - An empty
Disallow:means "disallow nothing" (allow everything). Disallow: /blocks the entire site.
Allow
The Allow: directive overrides a Disallow: for a more specific path. It is useful when you want to block a directory but permit specific pages within it.
User-agent: *
Disallow: /api/
Allow: /api/public/
This blocks all /api/ paths except those under /api/public/.
When both Allow and Disallow match a URL, the more specific (longer) rule wins. If they are the same length, Allow takes precedence. This is the Google standard -- other crawlers may handle conflicts differently.
Sitemap
The Sitemap: directive tells crawlers where to find your XML sitemap. Unlike other directives, it is not tied to a specific User-agent block.
Sitemap: https://example.com/sitemap.xml
You can list multiple sitemaps:
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml
The Sitemap directive must use the full absolute URL including the protocol.
Wildcards and Pattern Matching
Google and Bing support two wildcard characters in robots.txt directives that are not part of the original 1994 standard but are widely used:
Asterisk (*) -- Matches any sequence of characters.
Disallow: /search?q=*
Disallow: /*.pdf$
Dollar sign ($) -- Matches the end of the URL.
Disallow: /*.php$
This blocks all URLs ending in .php while allowing URLs like /page.php/comments.
Monitor Your robots.txt Automatically
Site Watcher tracks your robots.txt for changes, errors, and misconfigurations. Get alerted when crawler directives change unexpectedly.
Common Directives and Examples
Allow all crawlers full access
User-agent: *
Disallow:
An empty Disallow: means nothing is blocked. This is the most permissive configuration.
Block all crawlers from entire site
User-agent: *
Disallow: /
Useful for staging environments or sites under development. Do not use this in production unless you intentionally want to de-index your site.
Block specific directories
User-agent: *
Disallow: /admin/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /internal/
Block specific crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: *
Disallow: /admin/
This blocks GPTBot and CCBot from the entire site while allowing all other crawlers with standard restrictions.
Complex rules with Allow overrides
User-agent: Googlebot
Disallow: /api/
Allow: /api/docs/
Disallow: /internal/
Allow: /internal/press/
User-agent: *
Disallow: /api/
Disallow: /internal/
Googlebot gets access to /api/docs/ and /internal/press/, while all other crawlers are blocked from both directories entirely.
What robots.txt Does Not Do
Understanding the limitations is as important as understanding the functionality.
It Is Not a Security Tool
robots.txt does not prevent access to content. It is a request, not an enforcement mechanism. Anyone (or any bot) can still access URLs listed in Disallow: directives.
If you need to actually protect content, use:
- Authentication (login required)
- Server-side access controls (IP restrictions, firewalls)
- HTTP authentication (Basic or Digest auth)
Never put sensitive URLs in your robots.txt file. Ironically, listing a secret path in robots.txt makes it more discoverable, because attackers routinely scan robots.txt files to find hidden directories.
It Does Not Remove Pages from Google
Blocking a URL with robots.txt does not remove it from Google's index. If Google already knows about the URL (from external links, for example), it may continue to show it in search results -- just without a page description, because it cannot crawl the content.
To remove a page from Google:
- Use the
noindexmeta tag orX-Robots-Tagheader. - Use Google Search Console's URL removal tool for temporary removals.
- Return a 404 or 410 status code for permanent removals.
It Does Not Control Indexing
robots.txt controls crawling, not indexing. A page can appear in search results even if robots.txt blocks crawling, as long as Google knows the URL exists from other sources.
For indexing control, use:
<meta name="robots" content="noindex">to prevent indexing<meta name="robots" content="nofollow">to prevent following links on the pageX-Robots-TagHTTP header for non-HTML resources
A Complete robots.txt Example
Here is a well-structured robots.txt file for a typical business website:
# Robots.txt for example.com
# Last updated: 2026-02-15
# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /api/
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /search?
Disallow: /*?sort=
Disallow: /*?filter=
# Allow search engine crawlers to access API docs
User-agent: Googlebot
Allow: /api/docs/
User-agent: Bingbot
Allow: /api/docs/
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
# Sitemap locations
Sitemap: https://example.com/sitemap.xml
Testing Your robots.txt
Before deploying changes to your robots.txt, test them to ensure they work as intended.
Use Google's robots.txt Tester
Google Search Console includes a robots.txt testing tool. Enter a URL and it shows whether that URL is blocked or allowed for Googlebot. Note: this tool may be deprecated in favor of the URL Inspection tool.
Use the URL Inspection Tool
In Google Search Console, enter any URL and check whether Googlebot can access it. This tests the real-world effect of your robots.txt rules.
Manual verification
Open your robots.txt file in a browser (https://yourdomain.com/robots.txt) to confirm it is accessible and formatted correctly.
Check for syntax errors
Look for common mistakes: missing colons, incorrect paths, typos in User-agent names. Online robots.txt validators can catch formatting issues.
Verify crawler behavior
After deploying, monitor your server logs or Google Search Console to confirm crawlers are respecting your rules. Unexpected crawling of blocked paths indicates a problem.
Common robots.txt Mistakes
Blocking CSS and JavaScript files. Google needs access to your CSS and JS to render pages correctly. Blocking these resources can hurt your rankings because Google cannot see your page as users see it.
Blocking your sitemap. If your robots.txt blocks the path where your sitemap lives, crawlers cannot access it. Make sure the sitemap URL is not caught by any Disallow rules.
Using robots.txt to hide pages from Google. As covered above, this does not work. The page can still appear in search results with a truncated listing. Use noindex instead.
Forgetting the trailing slash. Disallow: /admin blocks /admin, /admin/, and /administrator. Disallow: /admin/ only blocks paths starting with /admin/. Be precise.
Not testing after changes. A single syntax error can change the meaning of your entire file. Always test before and after deployment.
robots.txt is the first thing search engines read when they visit your site -- make sure it says exactly what you intend.
Monitor robots.txt and More
Site Watcher tracks your robots.txt, sitemap, SSL certificates, DNS records, and uptime from one dashboard. Free for 3 targets. Unlimited monitoring for $39/mo.