A Look At robots.txt Files

18th May 2009 - 5 minutes read time

A robots.txt file is a simple, static, file that you can add to your site in order to stop search engines from crawling the content of certain pages or directories. You can even prevent certain user agents from crawling certain areas of you site.

Lets take a real-world example and look at what you would do if you decided to set up a Feedburner feed in place of your normal RSS feed. I won't go into why you would do this much, other than to say that you get some nice usage statistics and it can save on some processing power on your server as your feed is only looked at when Feedburner updates. Once you have allowed your blog to issue the Feeburner feed instead of your normal feed you then need to stop search engines from indexing the old feed. This stops is appearing in search indexes and things so that you can get your users to grab the Feedburner feed and not your local feed. You would then put a robots.txt file in place with the following content.

User-agent: *
Disallow: /feed

In another scenario you might want to stop a certain bot from crawling the content of your site. In the following example we are stopping a user agent called ia_archiver, which is used to create a copy of your site at archive.org. There are a couple of reasons why you might want to stop this from happening, but here is the rule you would need.

User-agent: ia_archiver
Disallow: /

So is it beneficial or even useful in terms of SEO? Well there is one use of the robots.txt file which can have beneficial results, although how beneficial depends on how big or complicated the site is. Let me explain.

Google, Yahoo, MSN and other search engines have adopted the sitemap.xml format. Along with this format is the ability to add a link to your sitemap.xml file from your robots.txt file. You might add the following to you robots.txt file.

Sitemap: http://www.example.com/sitemap.xml

This line is a little redundant as all of the search engines above will automatically look for a file called sitemap.xml situated on the root of your domain. You would normally use this option if your sitemap.xml file was created by your CMS and has a non-standard name. If this is the case then a rewrite rule might be better, but it is still possible to to this with your robots.txt file.

Sitemap: http://www.example.com/cms/feeds/sitemap/format/xml/

The power of this option becomes apparent when you have a large site and need to spread your sitemap file across different files. You can link to a sitemap index file (which contains references to other sitemap.xml files and might not be called sitemap.xml) or simply link multiple sitemap.xml files from a single robots.txt file. Here is a robots.txt file that references two sitemap files that each contain half of the site.

Sitemap: http://www.example.com/sitemapa-m.xml
Sitemap: http://www.example.com/sitemapn-z.xml

I had a quick look at lots of sites that are run by people in the SEO industry and found that many used the robots.txt file simply to disallow certain files or directories. A few used the file in order to point at their sitemap.xml file, which was usually called something else. What was more interesting was that many sites simply didn't have a file called robots.txt, which leads me to think that they aren't all that useful.

In my personal opinion I would only use a robots.txt file if you have anything that you want to prevent being crawled or you want to add a link to your XML sitemap. Setting up a robots.txt file that looks like this:

User-agent: *
Disallow:

Sitemap: http://www.example.com/sitemap.xml

Allows all user agents to crawl all pages on your site, but as this is the default behaviour anyway it might be better just to leave it out.

Add new comment

The content of this field is kept private and will not be shown publicly.