Robots.txt Guide: Controlling How Search Engines Crawl Your Site
Robots.txt Guide: Controlling How Search Engines Crawl Your Site
Robots.txt is a plain text file at the root of your website that tells search engine crawlers which areas of your site they are allowed or not allowed to access. It is the first file crawlers check before accessing your site, and it gives you control over crawl behavior without requiring changes to your pages.
How Robots.txt Works
When a search engine crawler (like Googlebot) arrives at your site, it first looks for yoursite.com/robots.txt. The file contains rules specifying which user agents (crawlers) can access which paths. If no robots.txt exists, crawlers assume they can access everything.
The file uses a simple syntax with three main directives. User-agent specifies which crawler the rules apply to (use * for all crawlers). Disallow specifies paths the crawler should not access. Allow explicitly permits access to paths within a disallowed directory.
A robots.txt that disallows /admin/ tells crawlers not to access any URL starting with /admin/. This prevents admin pages from appearing in search results and saves your crawl budget.
Common Robots.txt Rules
Block admin areas to prevent login pages, dashboard pages, and administrative tools from being indexed. These pages have no value in search results and waste crawl budget.
Block duplicate content paths like URL parameters, print versions, and filtered views. If your e-commerce site generates thousands of filtered URLs (/products?color=red&size=large), blocking the parameter paths prevents Google from crawling duplicate content.
Block internal search results pages. Your site’s search results pages are generated dynamically and create infinite URL combinations that waste crawl budget.
Allow CSS and JavaScript files. Google needs to render your pages to evaluate their content. Blocking CSS and JS files prevents rendering and can hurt your rankings.
Technical SEO Audit: Finding and Fixing Hidden Problems
Robots.txt vs Noindex
Robots.txt prevents crawling. The noindex meta tag prevents indexing. They serve different purposes.
If you disallow a page in robots.txt, crawlers will not visit it, but if other sites link to that page, Google might still index the URL (displaying it with limited information). To truly remove a page from search results, use the noindex meta tag on the page itself — but the crawler must be able to access the page to read the noindex directive.
Do not use robots.txt to hide pages from search results. Use noindex instead. Robots.txt is for managing crawl efficiency, not for removing pages from the index.
Sitemap Reference
Add a Sitemap directive at the bottom of your robots.txt file pointing to your XML sitemap URL. This helps search engines find your sitemap without relying solely on Search Console submission.
The syntax is simply: Sitemap: https://yoursite.com/sitemap.xml
XML Sitemaps Explained: Helping Search Engines Crawl Your Site
Testing Your Robots.txt
Test your robots.txt using the Robots.txt Tester in Google Search Console. Enter any URL from your site and the tool shows whether it is blocked or allowed under your current rules.
Test before deploying changes. An overly broad disallow rule can accidentally block important content from being crawled, which can be devastating for your search traffic.
Common Mistakes
Blocking your entire site with a Disallow: / rule during development and forgetting to remove it at launch is a surprisingly common and devastating mistake.
Blocking CSS and JavaScript prevents Google from rendering your pages properly, potentially hurting rankings.
Using robots.txt to hide sensitive information does not work because the file itself is publicly accessible. Anyone can read your robots.txt and discover the paths you are trying to hide.
Key Takeaways
- Robots.txt controls which areas of your site search engine crawlers can access
- Block admin areas, duplicate content paths, and internal search results to manage crawl budget
- Use noindex for removing pages from search results, not robots.txt
- Include a Sitemap directive pointing to your XML sitemap
- Test changes with Google Search Console’s Robots.txt Tester before deploying
- Never accidentally block your entire site or your CSS and JavaScript files
This content is for informational purposes only and reflects independently researched guidance. Platform features and pricing change frequently — verify current details with providers.