Robots.txt: Complete Guide to Control Crawling

The robots.txt file is a simple text file placed at your website's root that tells search engine crawlers which pages they can and cannot access. While it seems straightforward, incorrect configuration can have serious SEO consequences - from wasting crawl budget to accidentally blocking your entire site.

What is robots.txt?

The robots.txt file (also called the robots exclusion protocol) is a text file that sits at the root of your website (e.g., https://yoursite.com/robots.txt). It provides instructions to web crawlers about which parts of your site they should or shouldn't access.

Key characteristics:

Location: Must be at the root domain (not in subdirectories)
Filename: Must be exactly "robots.txt" (lowercase)
Format: Plain text file
Protocol: Follows the Robots Exclusion Standard

First stop Crawlers check robots.txt before accessing any other page on your site

When Googlebot (or any well-behaved crawler) visits your website, it first requests your robots.txt file. Based on the directives it finds, it knows which pages it's allowed to crawl.

How Robots.txt Works

The crawling process follows this sequence:

Crawler wants to visit your site
Crawler requests https://yoursite.com/robots.txt
Crawler reads and interprets the rules
Crawler follows the rules when accessing pages

What Happens Without robots.txt?

If no robots.txt file exists, crawlers assume they can access everything. This isn't necessarily bad - many small sites operate fine without one.

What Happens with Errors?

404 (not found): Crawlers assume full access
500 (server error): Crawlers may defer crawling
403 (forbidden): Treatment varies by crawler

robots.txt is a directive, not a security measure. Malicious bots can ignore it, and blocked URLs can still be indexed if other pages link to them. Never use robots.txt to hide sensitive information.

Syntax and Directives

The robots.txt syntax is relatively simple but must be exact:

User-agent

Specifies which crawler the rules apply to:

User-agent: *           # All crawlers
User-agent: Googlebot   # Only Googlebot
User-agent: Bingbot     # Only Bingbot

Disallow

Blocks access to specified paths:

Disallow: /admin/       # Block /admin/ directory
Disallow: /private      # Block URLs starting with /private
Disallow: /             # Block entire site

Allow

Explicitly permits access (useful to override Disallow):

Disallow: /folder/
Allow: /folder/public/  # Allow this subdirectory

Sitemap

Points to your XML sitemap:

Sitemap: https://yoursite.com/sitemap.xml

Wildcards

Pattern matching for flexible rules:

Disallow: /*.pdf$       # Block all PDFs
Disallow: /*?           # Block URLs with parameters
Disallow: /*/temp       # Block /temp in any directory

Directive	Purpose	Example
User-agent	Target specific crawlers	User-agent: Googlebot
Disallow	Block paths	Disallow: /admin/
Allow	Permit paths (override)	Allow: /admin/public/
Sitemap	Reference sitemap	Sitemap: /sitemap.xml
*	Wildcard (any characters)	Disallow: /*?sort=
$	End of URL	Disallow: /*.pdf$

Common Examples

Allow Everything (Default)

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/

Sitemap: https://yoursite.com/sitemap.xml

Block Parameters (Filter Pages)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=

Sitemap: https://yoursite.com/sitemap.xml

Block Specific Bots

User-agent: *
Allow: /

User-agent: AhrefsBot
Disallow: /

User-agent: SemrushBot
Disallow: /

WordPress Example

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /readme.html
Disallow: /license.txt

Sitemap: https://yoursite.com/sitemap_index.xml

Ensure Your New Content Gets Indexed

A well-configured robots.txt is just the start. RSS AutoIndex actively notifies Google about your new content, ensuring it gets crawled and indexed quickly.

Try RSS AutoIndex Free

Testing Your Robots.txt

Always test your robots.txt before deploying changes:

Google Search Console

Go to Search Console
Navigate to Settings > robots.txt
Use the testing tool to check specific URLs

Manual Testing

Visit https://yoursite.com/robots.txt in a browser
Verify the file loads correctly
Check syntax carefully

Online Validators

Use tools like:

Google's robots.txt Tester
Merkle's robots.txt Test Tool
Screaming Frog's robots.txt validation

Test every URL pattern you're blocking. A single typo can block important pages or fail to block what you intended.

Best Practices

Keep it simple: Use the minimum necessary rules
Test thoroughly: Verify all rules before deploying
Document your rules: Add comments explaining why each rule exists
Include sitemap: Always reference your sitemap(s)
Be specific: Target exact paths rather than broad patterns
Monitor changes: Track modifications in version control
Regular audits: Review quarterly for outdated rules

Commenting Your Rules

Add comments for future reference:

# Block admin area
User-agent: *
Disallow: /admin/

# Allow Googlebot access to AJAX
Allow: /wp-admin/admin-ajax.php

# Block search results pages (duplicate content)
Disallow: /search/

# Block staging subdirectory
Disallow: /staging/

Common Mistakes to Avoid

1. Blocking Important Resources

Don't block CSS, JavaScript, or images that Google needs to render pages:

# BAD - blocks rendering resources
Disallow: /css/
Disallow: /js/
Disallow: /images/

2. Blocking Your Entire Site

A single slash blocks everything:

# This blocks EVERYTHING!
User-agent: *
Disallow: /

3. Wrong File Location

robots.txt must be at the root. These won't work:

https://yoursite.com/pages/robots.txt
https://yoursite.com/blog/robots.txt
https://subdomain.yoursite.com/robots.txt (only for subdomain)

4. Case Sensitivity

URLs are case-sensitive. /Admin/ and /admin/ are different.

5. Using for Security

robots.txt doesn't hide content. Anyone can read it, and search engines can still index blocked URLs via links.

6. Forgetting Trailing Slashes

# These are different:
Disallow: /admin   # Blocks /admin, /administrator, /admin-panel
Disallow: /admin/  # Blocks only /admin/ directory

Important Limitations

Understand what robots.txt cannot do:

Not a Security Tool

Anyone can read your robots.txt file. Don't use it to hide sensitive content.

Doesn't Prevent Indexing

If other sites link to a blocked page, Google may still index the URL (without the content). Use noindex meta tags instead for pages you don't want indexed at all.

Doesn't Remove Already-Indexed Pages

Blocking a previously indexed page doesn't remove it from search results. Use the URL Removal tool for that.

Not All Bots Obey

Malicious bots and scrapers ignore robots.txt. Only use it for legitimate search engine crawlers.

With our RSS indexing solution, your content is automatically submitted to search engines.

Conclusion

The robots.txt file is a powerful tool for controlling how search engines crawl your website. Used correctly, it helps optimize your crawl budget and keeps search engines focused on your important content.

Key takeaways:

Place robots.txt at your domain root
Use specific paths rather than broad patterns
Always test before deploying changes
Include your sitemap reference
Don't block resources needed for rendering
Remember it's not a security measure
Use noindex for pages you don't want indexed (not just disallow)

Optimize Your Crawling and Indexing

While robots.txt controls what gets crawled, RSS AutoIndex ensures your new content gets noticed by Google immediately after publication.

Start Free Trial