The robots.txt file is a file that sits at the root level of your web site and asks spiders and bots to behave themselves when they're on your site. You can take a look at it by pointing your browser to http://www.yourDrupalsite.com/robots.txt. Think of it like an electronic No Trespassing sign that can easily tell the search engines not to crawl a certain directory or page of your site. Using wildcards, you can even tell the engines not to crawl certain file types like .jpg or .pdf. This means none of your JPEG images or PDF files will show up in the search engines. (I'm not recommending that you do that…but you could.)
On December 1, 2008, John Mueller, a Google analyst, said that if the Googlebot can't access the robots.txt file (say the server is unreachable or returns a 5xx error result code) then it won't crawl the web site at all. In other words, the robots.txt file must be there if you want the web site to be crawled and indexed by Google. Read his full comment at the following link: http://budURL.com/robotstxt.
Drupal 6 provides a standard robots.txt file that does an OK job.
The Drupal 6 robots.txt file carries instructions for robots and spiders that may crawl your site.
Let's take a deeper look at each directive used in the Drupal robots.txt file. This is a bit tedious, but it's truly worth it to understand exactly what you're telling the search engines to do.
# - Hides text from the robot. This is a good way to put in notes or comments.
User-agent: *
Tells which robot should read the following instructions. * means all robots.
Crawl-delay: X
The delay, in seconds, between page request from a bot. Replace X with a whole number between 1 and 20. Note that this directive is ignored by Google. You can adjust the crawl delay by using Google's Webmaster Tools.
Disallow: /path/ Disallow: file. txt
Says to the robots, 'Don't crawl this!'. In the case of paths, it won't crawl anything in that directory or below it.
Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:
1. To match a sequence of characters use: *.
Eg:
Disallow: /dev*/
This will exclude any subdirectory that begins with the letters "dev".
2. To block access to all URLs that include a X use the same *.
Eg:
Disallow: /*?
which is the same with:
Disallow: /*?*
and both will disallow any path with a ? in it.
3. To Specify a three letter extension at the end of any file use: * and $.
For example, this:
Disallow: /*pdf$
will exclude any files that end with pdf across your entire site. However, it allows any file with pdf in the middle of the filename like pdfdocslist.php.
Come back to this section to walk through the steps when you want to make each change.
1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt
2. Using your FTP program or command line editor, navigate to the top level of your Drupal web site and locate the robots.txt file. If , for some reason the robots.txt file is missing you can easily create one, using any plain text editor like Notepad or TextEdit. Avoid using a word processor, though, as they add additional content which will make the file unreadable to the search engines.
3. Make a backup of the file.
4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.
5. Most directives in the robots.txt file are based on the line User-agent:. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.
6. Add the lines you want. Later in this chapter, you'll learn several changes which will help you with your SEO.
7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).
8. Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes.
There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility (detailed instructions on this utility later in this chapter) to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled. The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.
Google what? Googlebot! Google and other search engines use server systems (sometimes called spiders, crawlers, or robots) to go around the Internet and find each web site. We sometimes refer to Google's system as the Googlebot to distinguish it from other search engine robots. While Google doesn't report this number anymore, it is estimated that the Googlebot crawls 10 billion web sites each week! That is a fast little robot.
For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster! Fortunately, this is relatively easy to fix.
Our website is not responsible for the information contained by this article. Webworldarticles.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.
This article was sent to us by:
Gregory F. at
01172010
1. How to fix the Drupal robots txt file
All articles in this directory are property of their respective authors. Additionally, read our Privacy Policy
© 2010 WebWorldarticles.com - All Rights Reserved.