How to Create a robots.txt File: Format, Rules, and Platform Guides
Step-by-step guide to creating a robots.txt file with correct syntax, common directives, platform-specific instructions, and testing methods.
Last updated: 2026-02-17
Why Every Site Needs a robots.txt File
A robots.txt file gives you control over how search engine crawlers interact with your site. Without one, crawlers will attempt to access every URL they find -- including admin pages, duplicate content, search result pages, and other paths you may not want indexed.
Creating a robots.txt file is a five-minute task that pays dividends for the life of your site. It preserves crawl budget by directing crawlers away from unimportant pages, prevents indexing of internal tools, and provides a standard location to declare your sitemap.
This guide walks through creating a robots.txt file from scratch, configuring it for common platforms, and testing it before deployment.
File Format Basics
A robots.txt file is a plain text file with a specific structure. There is no HTML, no XML, no special encoding -- just plain text following a simple syntax.
Core Rules
- Filename: Must be exactly
robots.txt(lowercase). - Location: Must be in the root directory of your domain. The URL must be
https://yourdomain.com/robots.txt. - Encoding: UTF-8.
- Content-Type:
text/plain. - Line endings: Both
\n(Unix) and\r\n(Windows) are acceptable.
File Structure
Every robots.txt file consists of one or more "groups." Each group starts with a User-agent: line and contains one or more Disallow: or Allow: directives.
User-agent: [crawler name or * for all]
Disallow: [path to block]
Allow: [path to permit]
Sitemap: [full URL to sitemap]
Comments start with # and extend to the end of the line:
# This is a comment
User-agent: * # This applies to all crawlers
Disallow: /admin/
Creating Your robots.txt Step by Step
Create a new text file
Open any text editor (VS Code, Notepad, Sublime Text, nano) and create a new file. Name it exactly robots.txt. Do not use .txt.txt or any other extension -- your file manager may hide the extension, so double-check.
Add a default User-agent group
Start with a wildcard rule that applies to all crawlers. This is your baseline configuration.
Define your Disallow rules
List the paths you want to block from crawlers. Think about admin areas, internal tools, search results, and duplicate content paths.
Add any Allow overrides
If you blocked a broad path but need to permit a specific subdirectory, add Allow directives.
Add crawler-specific rules
If you need different rules for specific crawlers (blocking AI crawlers, giving Googlebot special access), add additional User-agent groups.
Declare your sitemap
Add one or more Sitemap directives with the full URL to your XML sitemap.
Upload to your site root
Place the file at the root of your web server so it is accessible at https://yourdomain.com/robots.txt.
Building a robots.txt From Scratch
Let's build a complete robots.txt file for a typical business website, explaining each section.
Step 1: Start with the default rules
# robots.txt for yourdomain.com
User-agent: *
The User-agent: * line means these rules apply to every crawler that does not have its own specific section.
Step 2: Block administrative and internal paths
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/
Disallow: /internal/
These are paths that serve no purpose in search results. Admin panels, API endpoints, and internal tools should not be crawled.
Step 3: Block search and filter URLs
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/
Disallow: /internal/
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
Search result pages, sorted lists, and filtered views create massive amounts of duplicate content. Blocking these paths prevents crawlers from wasting budget on low-value URL variations.
The * wildcard in /\*?sort= matches any characters before ?sort=. This catches URLs like /products?sort=price and /blog?sort=date. The $ end-of-URL anchor is also available but not needed here.
Step 4: Add Allow overrides where needed
User-agent: Googlebot
Allow: /api/docs/
Allow: /api/public/
If you have public API documentation or a public API that should be indexed, override the blanket /api/ block for Googlebot specifically.
Step 5: Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
Many site owners now block AI training crawlers. Each crawler needs its own User-agent block with Disallow: / to block it from the entire site.
Step 6: Declare your sitemap
Sitemap: https://yourdomain.com/sitemap.xml
Monitor Your robots.txt Changes
Site Watcher alerts you when your robots.txt file changes unexpectedly -- catching misconfigurations before they affect crawling.
The Complete File
Here is the finished robots.txt:
# robots.txt for yourdomain.com
# Last updated: 2026-02-17
# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/
Disallow: /internal/
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=
# Allow Googlebot access to public API docs
User-agent: Googlebot
Allow: /api/docs/
Allow: /api/public/
# Block AI training crawlers
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: anthropic-ai
Disallow: /
# Sitemap
Sitemap: https://yourdomain.com/sitemap.xml
Platform-Specific Instructions
WordPress
WordPress auto-generates a virtual robots.txt file. To customize it:
Option 1: Physical file. Create a robots.txt file and upload it to your WordPress root directory (the same directory as wp-config.php). A physical file overrides WordPress's virtual one.
Option 2: SEO plugin. Yoast SEO, Rank Math, and All in One SEO all provide robots.txt editors in their settings. This is the easiest approach because it does not require FTP access.
WordPress's default virtual robots.txt blocks /wp-admin/ but allows /wp-admin/admin-ajax.php (needed for front-end functionality). If you create a custom file, replicate this rule.
A typical WordPress robots.txt:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/
Sitemap: https://yourdomain.com/sitemap_index.xml
Shopify
Shopify generates a robots.txt automatically, and you cannot upload a file directly. To customize it:
- Go to Online Store > Themes.
- Click Actions > Edit code.
- In the Templates folder, add a new template called
robots.txt.liquid. - Add your custom robots.txt content using Liquid syntax.
Shopify's default robots.txt blocks admin, cart, checkout, and internal paths. Review it before making changes to ensure you do not accidentally allow paths that should remain blocked.
Next.js
In a Next.js project, you can create a static or dynamic robots.txt.
Static file: Place a robots.txt file in the public/ directory. It will be served at the root path.
Dynamic generation (App Router):
// app/robots.ts
import { MetadataRoute } from 'next'
export default function robots(): MetadataRoute.Robots {
return {
rules: [
{
userAgent: '*',
disallow: ['/api/', '/admin/', '/internal/'],
},
{
userAgent: 'GPTBot',
disallow: ['/'],
},
],
sitemap: 'https://yourdomain.com/sitemap.xml',
}
}
Static Site Generators (Gatsby, Hugo, Astro)
Place a robots.txt file in the static/public directory:
- Gatsby:
static/robots.txt - Hugo:
static/robots.txt - Astro:
public/robots.txt
The file is copied as-is to the build output root.
Common Blocking Rules
Here are ready-to-use directive patterns for common scenarios.
Block all crawlers from entire site
User-agent: * then Disallow: / -- useful for staging sites and sites under development.
Block URL parameters
Disallow: /*? -- blocks all URLs containing query parameters. Be careful: this may block legitimate parameterized pages.
Block specific file types
Disallow: /*.pdf$ -- blocks PDF files from being crawled. Useful if PDFs contain duplicate content or sensitive documents.
Block specific sections
Disallow: /category/ Disallow: /tag/ -- blocks WordPress taxonomy archive pages that often create thin, duplicate content.
Crawl-delay for aggressive bots
Crawl-delay: 10 -- asks crawlers to wait 10 seconds between requests. Supported by Bing and Yandex, but not Google.
Testing Your robots.txt
Never deploy a new robots.txt without testing. A single mistake can block Googlebot from your entire site.
Validate syntax online
Use an online robots.txt validator to check for syntax errors. These tools parse the file and flag incorrect formatting, missing colons, or invalid directives.
Test specific URLs
Use Google Search Console's URL Inspection tool to test whether specific URLs are blocked. Enter important pages and verify they are not accidentally blocked.
Check in a browser
Navigate to https://yourdomain.com/robots.txt in your browser. Verify it loads as plain text, not as an HTML page or error.
Review the Content-Type header
Use browser developer tools (Network tab) to confirm the response has Content-Type: text/plain. An incorrect Content-Type can cause parsing failures.
Monitor after deployment
After deploying, watch your Google Search Console crawl stats and index coverage for unexpected changes. A misconfigured robots.txt will show effects within days.
The most dangerous robots.txt mistake is accidentally deploying a staging configuration to production. A Disallow: / rule in production will de-index your entire site. Always verify after deployment.
Common Mistakes to Avoid
| Mistake | Problem | Fix |
|---|---|---|
| Disallow: / in production | Blocks entire site from crawling | Remove or scope to specific paths |
| Blocking CSS/JS files | Google cannot render pages correctly | Allow access to static assets |
| Using for security | Exposes paths you want hidden | Use authentication and access controls instead |
| No Sitemap directive | Crawlers may not find your sitemap | Add Sitemap: line with full URL |
| Wrong file location | Crawlers cannot find the file | Must be at domain root, not in a subdirectory |
| Conflicting rules | Unpredictable crawler behavior | Test with specific URLs to verify behavior |
Not updating after site changes. When you add new sections to your site or restructure URLs, review your robots.txt to ensure the rules still make sense. A rule that blocked /old-admin/ is useless after a migration to /dashboard/, and the new path may need blocking.
Forgetting subdomains. Each subdomain needs its own robots.txt. Rules at example.com/robots.txt do not apply to blog.example.com or shop.example.com.
Overly broad rules. Disallow: /p blocks /pricing, /products, /press, and every other path starting with /p. Be specific in your rules to avoid unintended blocking.
When to Update Your robots.txt
Review and update your robots.txt when:
- You launch new sections of your site
- You migrate to a new URL structure
- You add or remove admin/internal tools
- New crawler User-agents emerge that you want to block
- You change CMS platforms
- You move from HTTP to HTTPS
- You add or change subdomains
Set a quarterly reminder to review your robots.txt even if nothing obvious has changed. Site evolution can make old rules obsolete or introduce new paths that need blocking.
A well-crafted robots.txt file is just a few lines of text, but those lines determine how every search engine crawler interacts with your site.
Monitor robots.txt and Site Health
Site Watcher tracks your robots.txt, sitemap, SSL certificates, DNS records, and uptime from a single dashboard. Free for 3 targets. $39/mo for unlimited monitoring.