For most people, the web site is a collection of sites that are posted online for hobbies, work, amusement or business. When we want to find a site on a particular topic we open up our favorite search engine and with a few keywords, find what we need. For some, little thought is given to how those pages get linked to those keywords. How does the search engine know that this page is about this topic? How does it know two pages are related through links? The simple answer is the Robots.txt file that is used to inform search engine spiders how to interact and index your content.
When you use your site to do business or generate income, it’s a scary idea that one file on your server could make or break your site, but it’s very true. A misconfigured or missing robot.txt file could create a lot of damage to your search engine listing and rank as well as expose areas of your site you may not want publicly indexed. It is for these reasons that we are going to get inside of our robots.txt file and figure out what exactly do these robots want?
What is a Search Engine Bot?
A search engine bot is an automated software agent often called a spider, crawler, robot or bot. These software programs search the internet and perform several functions:
- Find new content that can be added to the index from textual content on the site
- Determine what the site is about so it can be classified using Meta tags, alt tags, titles or other information.
- Navigate the links on the site to find new content
The idea behind this content search is to provide a more quality search experience for the users of the bot’s search engine. Search engines that have a good, valuable database of content will be able to deliver better results to their user’s queries.
What is a Robots.txt File?
The Robots.txt file is an instructional list of commands that web bots can use to determine which content they are allowed to index and which areas of your site you want them to stay out of. You can also instruct bots that they can follow any, some or none of the links found on your site. If you do not have a robots.txt file when a bot comes to your site, it will get a 404 error since there is no file for it to access.
By default, a missing robots.txt file is permissions rather than denial of access. Robots can be greedy in their search for information and they will assume your entire site is fair game for indexing if there are no instructions otherwise. If this is your intent, you can upload a blank robots.txt file (just a regular file with nothing in it) to your server which will stop the 404 errors from showing up in your logs and give the robots permission the index without any further instructions.
Creating Rules for Robots
Very rarely do we want these bots to crawl our site and index all of the folders, files and information that are found there. We often have secure areas, script source folders, admin areas, etc… that do not belong in the public search index. In this instance we need to create rules in our robots.txt to instruct the bot to ignore these areas, like so:
The first line of this script
User-agent: * indicates that we want all bots to follow the instructions below. In the following section we will look at how to give instructions for specific search bots. The
Disallow commands above blocks foldersfrom being indexed in the search engine. To prevent pages from showing up in the listing by URL it is also good practice to add the
"noindex" command to the meta tag for pages that you do not want indexed as well:
<meta name="robots" content="noindex">
This Meta tag line should be included on any HTML page that search crawlers should not index to prevent it from showing up in the public listing under its URL. If you add the
Disallow command to the robots file but not the meta tag on the page it could still appear in search results.
Targeting Specific Search Bots
In the previous section we saw the
User-agent: * command that indicates instructions for all search engine bots. You can supply specific instructions to bots based on their name by using the User-agent: botname command. For example:
If you have “global” commands for all bots that also apply to the more specific bot, the commands need to be in both the “global” command section as well as the section for that bot. This ensures the command is read and interpreted properly. There is a rather long list of search engines available online and you would need to look up the robot identifier name for a particular one if you wanted to provide specific instructions for it. Here is a list of the most popular ones:
|Search Engine||Robot Indentifer|
|Yahoo||Yahoobot, Slurp, yahoobot-slurp|
You can use your server logs to see which search engine robots are visiting your site and what areas they are visiting based on the requests made to your pages.
Many of the larger search engines such as Yahoo, Google, MSN etc…allow for wildcards in the robots.txt file that can be used to convey a more specific command.
The question mark
(?) can be used to block access to all URLs, such as:
Or the dollar sign
($) can be used to specify matching at the end of the URL. This is helpful if you want to block URLs that end with a particular file extension, such as:
Using a SiteMap
For sites that are using dynamic forms of navigation, flash or other techniques instead of links you may be cutting off these robots from access to parts of your site. One way around this is to provide a link for the robots to your sitemap:
This will allow the robot to crawl the links on your site and index pages it may not have been able to access previously.
Setting Search Engine Crawl Delay
The crawl-delay of some search engines can be set which allow you to set crawl priorities. This is the number of seconds to wait between successive requests to the same server. If you have a site that is providing a large amount of real-time updates, frequent successive visits can over-load your server. The prevention for this is to set the time delay between requests:
Google does not pay attention to this command in the robots.txt file but you can set your crawl rate in the Google Webmaster Central settings.
There are a few other settings or characters that can be used in the robots.txt:
- The pound sign
(#)is used to make comment marks in the file for later reference
Allow: is the opposite of
Disallow. You can use this with
Disallowto deny access to a folder but
Allowaccess to a subdirectory.
Visit-timeis supported by some bots but other ignored it. It allows you to restrict the time in which web bots can visit and index your content. Given in military time format.
Request-rateis given as a fraction in the form of pages/per minute or 1/10m for 1 page every 10 minutes.
While most reputable bots will follow the instructions found in your robots.txt file, not all will. You should use proper precautions to protect secure areas by creating directory or .htaccess permissions that keep out people (or bots) that shouldn’t be there. Robots.txt is available to the public and can be found by any casual browser so don’t use it as a privacy mechanism. By creating your robots.txt file you should see some improvement in your search engine ranking as well as the removal of part of your site that shouldn’t be listed in the index.