The robots.txt file is a simple text file placed at your website's root that tells search engine crawlers which pages they can and cannot access. While it seems straightforward, incorrect configuration can have serious SEO consequences - from wasting crawl budget to accidentally blocking your entire site.
What is robots.txt?
The robots.txt file (also called the robots exclusion protocol) is a text file that sits at the root of your website (e.g., https://yoursite.com/robots.txt). It provides instructions to web crawlers about which parts of your site they should or shouldn't access.
Key characteristics:
- Location: Must be at the root domain (not in subdirectories)
- Filename: Must be exactly "robots.txt" (lowercase)
- Format: Plain text file
- Protocol: Follows the Robots Exclusion Standard
When Googlebot (or any well-behaved crawler) visits your website, it first requests your robots.txt file. Based on the directives it finds, it knows which pages it's allowed to crawl.
How Robots.txt Works
The crawling process follows this sequence:
- Crawler wants to visit your site
- Crawler requests https://yoursite.com/robots.txt
- Crawler reads and interprets the rules
- Crawler follows the rules when accessing pages
What Happens Without robots.txt?
If no robots.txt file exists, crawlers assume they can access everything. This isn't necessarily bad - many small sites operate fine without one.
What Happens with Errors?
- 404 (not found): Crawlers assume full access
- 500 (server error): Crawlers may defer crawling
- 403 (forbidden): Treatment varies by crawler
Syntax and Directives
The robots.txt syntax is relatively simple but must be exact:
User-agent
Specifies which crawler the rules apply to:
User-agent: * # All crawlers
User-agent: Googlebot # Only Googlebot
User-agent: Bingbot # Only Bingbot
Disallow
Blocks access to specified paths:
Disallow: /admin/ # Block /admin/ directory
Disallow: /private # Block URLs starting with /private
Disallow: / # Block entire site
Allow
Explicitly permits access (useful to override Disallow):
Disallow: /folder/
Allow: /folder/public/ # Allow this subdirectory
Sitemap
Points to your XML sitemap:
Sitemap: https://yoursite.com/sitemap.xml
Wildcards
Pattern matching for flexible rules:
Disallow: /*.pdf$ # Block all PDFs
Disallow: /*? # Block URLs with parameters
Disallow: /*/temp # Block /temp in any directory
| Directive | Purpose | Example |
|---|---|---|
| User-agent | Target specific crawlers | User-agent: Googlebot |
| Disallow | Block paths | Disallow: /admin/ |
| Allow | Permit paths (override) | Allow: /admin/public/ |
| Sitemap | Reference sitemap | Sitemap: /sitemap.xml |
| * | Wildcard (any characters) | Disallow: /*?sort= |
| $ | End of URL | Disallow: /*.pdf$ |
Common Examples
Allow Everything (Default)
User-agent: *
Allow: /
Sitemap: https://yoursite.com/sitemap.xml
Block Specific Directories
User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/
Sitemap: https://yoursite.com/sitemap.xml
Block Parameters (Filter Pages)
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=
Sitemap: https://yoursite.com/sitemap.xml
Block Specific Bots
User-agent: *
Allow: /
User-agent: AhrefsBot
Disallow: /
User-agent: SemrushBot
Disallow: /
WordPress Example
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /readme.html
Disallow: /license.txt
Sitemap: https://yoursite.com/sitemap_index.xml
Ensure Your New Content Gets Indexed
A well-configured robots.txt is just the start. RSS AutoIndex actively notifies Google about your new content, ensuring it gets crawled and indexed quickly.
Try RSS AutoIndex FreeTesting Your Robots.txt
Always test your robots.txt before deploying changes:
Google Search Console
- Go to Search Console
- Navigate to Settings > robots.txt
- Use the testing tool to check specific URLs
Manual Testing
- Visit https://yoursite.com/robots.txt in a browser
- Verify the file loads correctly
- Check syntax carefully
Online Validators
Use tools like:
- Google's robots.txt Tester
- Merkle's robots.txt Test Tool
- Screaming Frog's robots.txt validation
Best Practices
- Keep it simple: Use the minimum necessary rules
- Test thoroughly: Verify all rules before deploying
- Document your rules: Add comments explaining why each rule exists
- Include sitemap: Always reference your sitemap(s)
- Be specific: Target exact paths rather than broad patterns
- Monitor changes: Track modifications in version control
- Regular audits: Review quarterly for outdated rules
Commenting Your Rules
Add comments for future reference:
# Block admin area
User-agent: *
Disallow: /admin/
# Allow Googlebot access to AJAX
Allow: /wp-admin/admin-ajax.php
# Block search results pages (duplicate content)
Disallow: /search/
# Block staging subdirectory
Disallow: /staging/
Common Mistakes to Avoid
1. Blocking Important Resources
Don't block CSS, JavaScript, or images that Google needs to render pages:
# BAD - blocks rendering resources
Disallow: /css/
Disallow: /js/
Disallow: /images/
2. Blocking Your Entire Site
A single slash blocks everything:
# This blocks EVERYTHING!
User-agent: *
Disallow: /
3. Wrong File Location
robots.txt must be at the root. These won't work:
- https://yoursite.com/pages/robots.txt
- https://yoursite.com/blog/robots.txt
- https://subdomain.yoursite.com/robots.txt (only for subdomain)
4. Case Sensitivity
URLs are case-sensitive. /Admin/ and /admin/ are different.
5. Using for Security
robots.txt doesn't hide content. Anyone can read it, and search engines can still index blocked URLs via links.
6. Forgetting Trailing Slashes
# These are different:
Disallow: /admin # Blocks /admin, /administrator, /admin-panel
Disallow: /admin/ # Blocks only /admin/ directory
Important Limitations
Understand what robots.txt cannot do:
Not a Security Tool
Anyone can read your robots.txt file. Don't use it to hide sensitive content.
Doesn't Prevent Indexing
If other sites link to a blocked page, Google may still index the URL (without the content). Use noindex meta tags instead for pages you don't want indexed at all.
Doesn't Remove Already-Indexed Pages
Blocking a previously indexed page doesn't remove it from search results. Use the URL Removal tool for that.
Not All Bots Obey
Malicious bots and scrapers ignore robots.txt. Only use it for legitimate search engine crawlers.
With our RSS indexing solution, your content is automatically submitted to search engines.
Conclusion
The robots.txt file is a powerful tool for controlling how search engines crawl your website. Used correctly, it helps optimize your crawl budget and keeps search engines focused on your important content.
Key takeaways:
- Place robots.txt at your domain root
- Use specific paths rather than broad patterns
- Always test before deploying changes
- Include your sitemap reference
- Don't block resources needed for rendering
- Remember it's not a security measure
- Use noindex for pages you don't want indexed (not just disallow)
Optimize Your Crawling and Indexing
While robots.txt controls what gets crawled, RSS AutoIndex ensures your new content gets noticed by Google immediately after publication.
Start Free Trial