Minimize Spider Over Utilization of Your Site with Robots.txt
Written by Ben McInturff   
Wednesday, 29 April 2009 18:30

 

It seems that it has been a few years since I brushed up on Robots.txt files, and that in my lack of attention to them I missed that a useful new feature has been implemented by Yahoo and MSN.

You can now target these two crawlers and limit their utilization of your site in their crawls by using the directive crawl-delay: 

 

Crawl-delay: 5
 

So that a complete robots.txt could look like: 

User-Agent: *
Disallow: /images/
Allow: /
Crawl-delay: 5
 

For google's spider indexing intensity, you will have to use their webmaster tools feature to set it appropriately.

Changing your robots.txt to add this crawl-delay specification will allow you to decrease the intensity of Search Engine Crawler traffic to your site. However, I must note that not all search engine crawlers are good and will listen to your robots.txt request, and in those cases where they over utilize your site's resources, you will need to use a hardware or software firewall to either prevent them entirely, or to limit their ability to connect. 

 

Valid XHTML and CSS.