Robots txt file for Drupal examples and editing


Optimizing the robots.txt file

The robots.txt file is a file that sits at the root level of your web site and asks spiders and bots to behave themselves when they're on your site. You can take a look at it by pointing your browser to http://www.yourDrupalsite.com/robots.txt. Think of it like an electronic No Trespassing sign that can easily tell the search engines not to crawl a certain directory or page of your site. Using wildcards, you can even tell the engines not to crawl certain file types like .jpg or .pdf. This means none of your JPEG images or PDF files will show up in the search engines. (I'm not recommending that you do that…but you could.)

The robots.txt file is required by Google

On December 1, 2008, John Mueller, a Google analyst, said that if the Googlebot can't access the robots.txt file (say the server is unreachable or returns a 5xx error result code) then it won't crawl the web site at all. In other words, the robots.txt file must be there if you want the web site to be crawled and indexed by Google. Read his full comment at the following link: http://budURL.com/robotstxt.

Drupal 6 provides a standard robots.txt file that does an OK job.

The Drupal 6 robots.txt file carries instructions for robots and spiders that may crawl your site.

robots.txt directives

Let's take a deeper look at each directive used in the Drupal robots.txt file. This is a bit tedious, but it's truly worth it to understand exactly what you're telling the search engines to do.

# - Hides text from the robot. This is a good way to put in notes or comments.

User-agent: *

Tells which robot should read the following instructions. * means all robots.

Crawl-delay: X

The delay, in seconds, between page request from a bot. Replace X with a whole number between 1 and 20. Note that this directive is ignored by Google. You can adjust the crawl delay by using Google's Webmaster Tools.

Disallow: /path/ Disallow: file. txt

Says to the robots, 'Don't crawl this!'. In the case of paths, it won't crawl anything in that directory or below it.

Pattern matching

Google (but not all search engines) understands some wildcard characters. The following table explains the usage of a few wildcard characters:

1. To match a sequence of characters use: *.

Eg:

Disallow: /dev*/

This will exclude any subdirectory that begins with the letters "dev".

2. To block access to all URLs that include a X use the same *.

Eg:

Disallow: /*?

which is the same with:

Disallow: /*?*

and both will disallow any path with a ? in it.

3. To Specify a three letter extension at the end of any file use: * and $.

For example, this:

Disallow: /*pdf$

will exclude any files that end with pdf across your entire site. However, it allows any file with pdf in the middle of the filename like pdfdocslist.php.

Editing your robots.txt file

Come back to this section to walk through the steps when you want to make each change.

1. Check to see if your robots.txt file is there and available to visiting search bots. Open your browser and visit the following link: http://www.yourDrupalsite.com/robots.txt

2. Using your FTP program or command line editor, navigate to the top level of your Drupal web site and locate the robots.txt file. If , for some reason the robots.txt file is missing you can easily create one, using any plain text editor like Notepad or TextEdit. Avoid using a word processor, though, as they add additional content which will make the file unreadable to the search engines.

3. Make a backup of the file.

4. Open the robots.txt file for editing. If necessary, download the file and open it in a local text editor tool.

5. Most directives in the robots.txt file are based on the line User-agent:. If you are going to give different instructions to different engines, be sure to place them above the User-agent: *, as some search engines will only read the directives for * if you place their specific instructions following that section.

6. Add the lines you want. Later in this chapter, you'll learn several changes which will help you with your SEO.

7. Save your robots.txt file, uploading it if necessary, replacing the existing file (you backed it up, didn't you?).

8. Point your browser to http://www.yourDrupalsite.com/robots.txt and double-check that your changes are in effect. You may need to do a refresh on your browser to see the changes.

Problems with the default Drupal robots.txt file

There are several problems with the default Drupal robots.txt file. If you use Google Webmaster Tool's robots.txt testing utility (detailed instructions on this utility later in this chapter) to test each line of the file, you'll find that a lot of paths which look like they're being blocked will actually be crawled. The reason is that Drupal does not require the trailing slash (/) after the path to show you the content. Because of the way robots.txt files are parsed, Googlebot will avoid the page with the slash but crawl the page without the slash.

Google what? Googlebot! Google and other search engines use server systems (sometimes called spiders, crawlers, or robots) to go around the Internet and find each web site. We sometimes refer to Google's system as the Googlebot to distinguish it from other search engine robots. While Google doesn't report this number anymore, it is estimated that the Googlebot crawls 10 billion web sites each week! That is a fast little robot.

For example, /admin/ is listed as disallowed. As you would expect, the testing utility shows that http://www.yourDrupalsite.com/admin/ is disallowed. But, put in http://www.yourDrupalsite.com/admin (without the trailing slash) and you'll see that it is allowed. Disaster! Fortunately, this is relatively easy to fix.

Legal Disclaimer

Our website is not responsible for the information contained by this article. Webworldarticles.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.


This article was sent to us by: Gregory F. at 01172010

Related Articles

1. How to fix the Drupal robots txt file
Fixing the Drupal robots.txt file Carry out the following steps in order to fix the Drupal robots.txt file: 1. Make a backup of the robots.txt file. ...

2. Lead generation paths to conversion long way
Conversions mean different things to different web sites. visitors to do. Are there other people in your organization who have a stake in the web site? Get thei...

3. Ecommerce web sites and critical lead indicators
Analytics to watch The mantra of a great web site team should be 'measure everything'. Understanding what and why you're tracking certain things will help you m...

4. Lead generation metrics that are critical for your website
Critical lead generation metrics Lead generation sites are focused on getting people to contact them. Examples include real estate agents, attorneys, insurance ...

5. Using analytics metrics to make SEO decisions
Secondary metrics worth tracking Analytics data is great at showing trends in your site's visitors. These trends may be useful for making certain decisions abou...

6. Roles in the construction of digital advertising
Other roles that have emerged with the growth of digital advertising are more related to the production of online campaigns. Digital technical directors (DTDs) ov...

7. Website traffic analyzer
Web traffic is traffic which is getting generated on the internet or on your Local Area Network (or LAN) Several web hosting companies provide a web traffic statist...

8. Monitoring via the Cloud: Monitis versus Open Source Monitoring Software
If your company is using a complex IT infrastructure and you are responsible for delivering mission critical applications, and every time the system is down you get complai...

9. The First monitoring company to fully integrate systems
There is a problem out there in IT-Land that is seldom spoken of, but just might be an IT manager's greatest nemesis: wasted time.  Nowhere is this pr...

10. Inbound Link Building in Internet Marketing
A technique to generate inbound links to a website to increase web traffic and internet popularity.It is a most effective approach to build high Page Rank in search e...