Technical SEO 10 min read

Master robots.txt: Control What Google Can Crawl

The robots.txt file is your primary tool for controlling how search engine crawlers interact with your website. Learn how to use it effectively without accidentally blocking important content.

The robots.txt file is a simple text file placed at your website's root that tells search engine crawlers which pages they can and cannot access. While it seems straightforward, incorrect configuration can have serious SEO consequences - from wasting crawl budget to accidentally blocking your entire site.

What is robots.txt?

The robots.txt file (also called the robots exclusion protocol) is a text file that sits at the root of your website (e.g., https://yoursite.com/robots.txt). It provides instructions to web crawlers about which parts of your site they should or shouldn't access.

Key characteristics:

  • Location: Must be at the root domain (not in subdirectories)
  • Filename: Must be exactly "robots.txt" (lowercase)
  • Format: Plain text file
  • Protocol: Follows the Robots Exclusion Standard
First stop Crawlers check robots.txt before accessing any other page on your site

When Googlebot (or any well-behaved crawler) visits your website, it first requests your robots.txt file. Based on the directives it finds, it knows which pages it's allowed to crawl.

How Robots.txt Works

The crawling process follows this sequence:

  1. Crawler wants to visit your site
  2. Crawler requests https://yoursite.com/robots.txt
  3. Crawler reads and interprets the rules
  4. Crawler follows the rules when accessing pages

What Happens Without robots.txt?

If no robots.txt file exists, crawlers assume they can access everything. This isn't necessarily bad - many small sites operate fine without one.

What Happens with Errors?

  • 404 (not found): Crawlers assume full access
  • 500 (server error): Crawlers may defer crawling
  • 403 (forbidden): Treatment varies by crawler
robots.txt is a directive, not a security measure. Malicious bots can ignore it, and blocked URLs can still be indexed if other pages link to them. Never use robots.txt to hide sensitive information.

Syntax and Directives

The robots.txt syntax is relatively simple but must be exact:

User-agent

Specifies which crawler the rules apply to:

User-agent: *           # All crawlers
User-agent: Googlebot   # Only Googlebot
User-agent: Bingbot     # Only Bingbot

Disallow

Blocks access to specified paths:

Disallow: /admin/       # Block /admin/ directory
Disallow: /private      # Block URLs starting with /private
Disallow: /             # Block entire site

Allow

Explicitly permits access (useful to override Disallow):

Disallow: /folder/
Allow: /folder/public/  # Allow this subdirectory

Sitemap

Points to your XML sitemap:

Sitemap: https://yoursite.com/sitemap.xml

Wildcards

Pattern matching for flexible rules:

Disallow: /*.pdf$       # Block all PDFs
Disallow: /*?           # Block URLs with parameters
Disallow: /*/temp       # Block /temp in any directory
Directive Purpose Example
User-agent Target specific crawlers User-agent: Googlebot
Disallow Block paths Disallow: /admin/
Allow Permit paths (override) Allow: /admin/public/
Sitemap Reference sitemap Sitemap: /sitemap.xml
* Wildcard (any characters) Disallow: /*?sort=
$ End of URL Disallow: /*.pdf$

Common Examples

Allow Everything (Default)

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

Sitemap: https://yoursite.com/sitemap.xml

Block Parameters (Filter Pages)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=

Sitemap: https://yoursite.com/sitemap.xml

Block Specific Bots

User-agent: *
Allow: /

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

WordPress Example

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /readme.html
Disallow: /license.txt

Sitemap: https://yoursite.com/sitemap_index.xml

Ensure Your New Content Gets Indexed

A well-configured robots.txt is just the start. RSS AutoIndex actively notifies Google about your new content, ensuring it gets crawled and indexed quickly.

Try RSS AutoIndex Free

Testing Your Robots.txt

Always test your robots.txt before deploying changes:

Google Search Console

  1. Go to Search Console
  2. Navigate to Settings > robots.txt
  3. Use the testing tool to check specific URLs

Manual Testing

  1. Visit https://yoursite.com/robots.txt in a browser
  2. Verify the file loads correctly
  3. Check syntax carefully

Online Validators

Use tools like:

  • Google's robots.txt Tester
  • Merkle's robots.txt Test Tool
  • Screaming Frog's robots.txt validation
Test every URL pattern you're blocking. A single typo can block important pages or fail to block what you intended.

Best Practices

  1. Keep it simple: Use the minimum necessary rules
  2. Test thoroughly: Verify all rules before deploying
  3. Document your rules: Add comments explaining why each rule exists
  4. Include sitemap: Always reference your sitemap(s)
  5. Be specific: Target exact paths rather than broad patterns
  6. Monitor changes: Track modifications in version control
  7. Regular audits: Review quarterly for outdated rules

Commenting Your Rules

Add comments for future reference:

# Block admin area
User-agent: *
Disallow: /admin/

# Allow Googlebot access to AJAX
Allow: /wp-admin/admin-ajax.php

# Block search results pages (duplicate content)
Disallow: /search/

# Block staging subdirectory
Disallow: /staging/

Common Mistakes to Avoid

1. Blocking Important Resources

Don't block CSS, JavaScript, or images that Google needs to render pages:

# BAD - blocks rendering resources
Disallow: /css/
Disallow: /js/
Disallow: /images/

2. Blocking Your Entire Site

A single slash blocks everything:

# This blocks EVERYTHING!
User-agent: *
Disallow: /

3. Wrong File Location

robots.txt must be at the root. These won't work:

  • https://yoursite.com/pages/robots.txt
  • https://yoursite.com/blog/robots.txt
  • https://subdomain.yoursite.com/robots.txt (only for subdomain)

4. Case Sensitivity

URLs are case-sensitive. /Admin/ and /admin/ are different.

5. Using for Security

robots.txt doesn't hide content. Anyone can read it, and search engines can still index blocked URLs via links.

6. Forgetting Trailing Slashes

# These are different:
Disallow: /admin   # Blocks /admin, /administrator, /admin-panel
Disallow: /admin/  # Blocks only /admin/ directory

Important Limitations

Understand what robots.txt cannot do:

Not a Security Tool

Anyone can read your robots.txt file. Don't use it to hide sensitive content.

Doesn't Prevent Indexing

If other sites link to a blocked page, Google may still index the URL (without the content). Use noindex meta tags instead for pages you don't want indexed at all.

Doesn't Remove Already-Indexed Pages

Blocking a previously indexed page doesn't remove it from search results. Use the URL Removal tool for that.

Not All Bots Obey

Malicious bots and scrapers ignore robots.txt. Only use it for legitimate search engine crawlers.

Conclusion

The robots.txt file is a powerful tool for controlling how search engines crawl your website. Used correctly, it helps optimize your crawl budget and keeps search engines focused on your important content.

Key takeaways:

  • Place robots.txt at your domain root
  • Use specific paths rather than broad patterns
  • Always test before deploying changes
  • Include your sitemap reference
  • Don't block resources needed for rendering
  • Remember it's not a security measure
  • Use noindex for pages you don't want indexed (not just disallow)

Optimize Your Crawling and Indexing

While robots.txt controls what gets crawled, RSS AutoIndex ensures your new content gets noticed by Google immediately after publication.

Start Free Trial