How to Create robots.txt

Why Every Site Needs a robots.txt File

A robots.txt file gives you control over how search engine crawlers interact with your site. Without one, crawlers will attempt to access every URL they find -- including admin pages, duplicate content, search result pages, and other paths you may not want indexed.

Creating a robots.txt file is a five-minute task that pays dividends for the life of your site. It preserves crawl budget by directing crawlers away from unimportant pages, prevents indexing of internal tools, and provides a standard location to declare your sitemap.

This guide walks through creating a robots.txt file from scratch, configuring it for common platforms, and testing it before deployment. For a broader perspective, see our Website Maintenance and Monitoring Guide.

File Format Basics

A robots.txt file is a plain text file with a specific structure. There is no HTML, no XML, no special encoding -- just plain text following a simple syntax.

Core Rules

Filename: Must be exactly robots.txt (lowercase).
Location: Must be in the root directory of your domain. The URL must be https://yourdomain.com/robots.txt.
Encoding: UTF-8.
Content-Type: text/plain.
Line endings: Both \n (Unix) and \r\n (Windows) are acceptable.

File Structure

Every robots.txt file consists of one or more "groups." Each group starts with a User-agent: line and contains one or more Disallow: or Allow: directives.

User-agent: [crawler name or * for all]
Disallow: [path to block]
Allow: [path to permit]

Sitemap: [full URL to sitemap]

Comments start with # and extend to the end of the line:

# This is a comment
User-agent: *  # This applies to all crawlers
Disallow: /admin/

Creating Your robots.txt Step by Step

Create a new text file

Open any text editor (VS Code, Notepad, Sublime Text, nano) and create a new file. Name it exactly robots.txt. Do not use .txt.txt or any other extension -- your file manager may hide the extension, so double-check.

Add a default User-agent group

Start with a wildcard rule that applies to all crawlers. This is your baseline configuration.

Define your Disallow rules

List the paths you want to block from crawlers. Think about admin areas, internal tools, search results, and duplicate content paths.

Add any Allow overrides

If you blocked a broad path but need to permit a specific subdirectory, add Allow directives.

Add crawler-specific rules

If you need different rules for specific crawlers (blocking AI crawlers, giving Googlebot special access), add additional User-agent groups.

Declare your sitemap

Add one or more Sitemap directives with the full URL to your XML sitemap.

Upload to your site root

Place the file at the root of your web server so it is accessible at https://yourdomain.com/robots.txt.

Building a robots.txt From Scratch

Let's build a complete robots.txt file for a typical business website, explaining each section.

Step 1: Start with the default rules

# robots.txt for yourdomain.com

User-agent: *

The User-agent: * line means these rules apply to every crawler that does not have its own specific section.

Step 2: Block administrative and internal paths

User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/
Disallow: /internal/

These are paths that serve no purpose in search results. Admin panels, API endpoints, and internal tools should not be crawled.

Step 3: Block search and filter URLs

User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/
Disallow: /internal/
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

Search result pages, sorted lists, and filtered views create massive amounts of duplicate content. Blocking these paths prevents crawlers from wasting budget on low-value URL variations.

The * wildcard in /\*?sort= matches any characters before ?sort=. This catches URLs like /products?sort=price and /blog?sort=date. The $ end-of-URL anchor is also available but not needed here.

Step 4: Add Allow overrides where needed

User-agent: Googlebot
Allow: /api/docs/
Allow: /api/public/

If you have public API documentation or a public API that should be indexed, override the blanket /api/ block for Googlebot specifically.

Step 5: Block AI training crawlers

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

Many site owners now block AI training crawlers. Each crawler needs its own User-agent block with Disallow: / to block it from the entire site.

Step 6: Declare your sitemap

Sitemap: https://yourdomain.com/sitemap.xml

Monitor Your robots.txt Changes

Site Watcher alerts you when your robots.txt file changes unexpectedly -- catching misconfigurations before they affect crawling.

The Complete File

Here is the finished robots.txt:

# robots.txt for yourdomain.com
# Last updated: 2026-02-17

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /dashboard/
Disallow: /api/
Disallow: /internal/
Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*?page=

# Allow Googlebot access to public API docs
User-agent: Googlebot
Allow: /api/docs/
Allow: /api/public/

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

# Sitemap
Sitemap: https://yourdomain.com/sitemap.xml

Platform-Specific Instructions

WordPress

WordPress auto-generates a virtual robots.txt file. To customize it:

Option 1: Physical file. Create a robots.txt file and upload it to your WordPress root directory (the same directory as wp-config.php). A physical file overrides WordPress's virtual one.

Option 2: SEO plugin. Yoast SEO, Rank Math, and All in One SEO all provide robots.txt editors in their settings. This is the easiest approach because it does not require FTP access.

WordPress's default virtual robots.txt blocks /wp-admin/ but allows /wp-admin/admin-ajax.php (needed for front-end functionality). If you create a custom file, replicate this rule.

A typical WordPress robots.txt:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /wp-json/
Disallow: /?s=
Disallow: /search/

Sitemap: https://yourdomain.com/sitemap_index.xml

Shopify

Shopify generates a robots.txt automatically, and you cannot upload a file directly. To customize it:

Go to Online Store > Themes.
Click Actions > Edit code.
In the Templates folder, add a new template called robots.txt.liquid.
Add your custom robots.txt content using Liquid syntax.

Shopify's default robots.txt blocks admin, cart, checkout, and internal paths. Review it before making changes to ensure you do not accidentally allow paths that should remain blocked.

Next.js

In a Next.js project, you can create a static or dynamic robots.txt.

Static file: Place a robots.txt file in the public/ directory. It will be served at the root path.

Dynamic generation (App Router):

// app/robots.ts
import { MetadataRoute } from 'next'

export default function robots(): MetadataRoute.Robots {
  return {
    rules: [
      {
        userAgent: '*',
        disallow: ['/api/', '/admin/', '/internal/'],
      },
      {
        userAgent: 'GPTBot',
        disallow: ['/'],
      },
    ],
    sitemap: 'https://yourdomain.com/sitemap.xml',
  }
}

Static Site Generators (Gatsby, Hugo, Astro)

Place a robots.txt file in the static/public directory:

Gatsby: static/robots.txt
Hugo: static/robots.txt
Astro: public/robots.txt

The file is copied as-is to the build output root.

Common Blocking Rules

Here are ready-to-use directive patterns for common scenarios.

Block all crawlers from entire site

User-agent: * then Disallow: / -- useful for staging sites and sites under development.

Block URL parameters

Disallow: /*? -- blocks all URLs containing query parameters. Be careful: this may block legitimate parameterized pages.

Block specific file types

Disallow: /*.pdf$ -- blocks PDF files from being crawled. Useful if PDFs contain duplicate content or sensitive documents.

Block specific sections

Disallow: /category/ Disallow: /tag/ -- blocks WordPress taxonomy archive pages that often create thin, duplicate content.

Crawl-delay for aggressive bots

Crawl-delay: 10 -- asks crawlers to wait 10 seconds between requests. Supported by Bing and Yandex, but not Google.

Testing Your robots.txt

Never deploy a new robots.txt without testing. A single mistake can block Googlebot from your entire site.

Validate syntax online

Use an online robots.txt validator to check for syntax errors. These tools parse the file and flag incorrect formatting, missing colons, or invalid directives.

Test specific URLs

Use Google Search Console's URL Inspection tool to test whether specific URLs are blocked. Enter important pages and verify they are not accidentally blocked.

Check in a browser

Navigate to https://yourdomain.com/robots.txt in your browser. Verify it loads as plain text, not as an HTML page or error.

Review the Content-Type header

Use browser developer tools (Network tab) to confirm the response has Content-Type: text/plain. An incorrect Content-Type can cause parsing failures.

Monitor after deployment

After deploying, watch your Google Search Console crawl stats and index coverage for unexpected changes. A misconfigured robots.txt will show effects within days.

The most dangerous robots.txt mistake is accidentally deploying a staging configuration to production. A Disallow: / rule in production will de-index your entire site. Always verify after deployment.

Common Mistakes to Avoid

Mistake	Problem	Fix
Disallow: / in production	Blocks entire site from crawling	Remove or scope to specific paths
Blocking CSS/JS files	Google cannot render pages correctly	Allow access to static assets
Using for security	Exposes paths you want hidden	Use authentication and access controls instead
No Sitemap directive	Crawlers may not find your sitemap	Add Sitemap: line with full URL
Wrong file location	Crawlers cannot find the file	Must be at domain root, not in a subdirectory
Conflicting rules	Unpredictable crawler behavior	Test with specific URLs to verify behavior

Not updating after site changes. When you add new sections to your site or restructure URLs, review your robots.txt to ensure the rules still make sense. A rule that blocked /old-admin/ is useless after a migration to /dashboard/, and the new path may need blocking.

Forgetting subdomains. Each subdomain needs its own robots.txt. Rules at example.com/robots.txt do not apply to blog.example.com or shop.example.com.

Overly broad rules. Disallow: /p blocks /pricing, /products, /press, and every other path starting with /p. Be specific in your rules to avoid unintended blocking.

When to Update Your robots.txt

Review and update your robots.txt when:

You launch new sections of your site
You migrate to a new URL structure
You add or remove admin/internal tools
New crawler User-agents emerge that you want to block
You change CMS platforms
You move from HTTP to HTTPS
You add or change subdomains

Set a quarterly reminder to review your robots.txt even if nothing obvious has changed. Site evolution can make old rules obsolete or introduce new paths that need blocking.

For related guidance, see the complete robots.txt guide, how to block AI crawlers, and what is a sitemap. Google's robots.txt specification defines the standard for directive parsing, and you can validate your file using robots.txt testing tools.

A well-crafted robots.txt file is just a few lines of text, but those lines determine how every search engine crawler interacts with your site.

Monitor robots.txt and Site Health

Site Watcher tracks your robots.txt, sitemap, SSL certificates, DNS records, and uptime from a single dashboard. Free for 3 targets. $39/mo for unlimited monitoring.