A crawler - also known as a spider, robot, or worm (not to be confused with a worm virus) - is an automated tool that visits a web page, finds the information on the page, and then follows links to other pages within that site. The job of the crawler is to find the information and hand it off to the search engine's indexers. Web crawlers do not actually search the Web at all.
They work much the way your browser does, sending a request to a web server for a web page, downloading everything on that page and giving it to the indexer.
Crawlers find information in two ways. Early on, you could send the search engine your information and it would be added to the database. The crawler would take the information you sent and go retrieve the web pages. Unfortunately, people overwhelmed the "add URL" pages on the search engines with bogus posts and the search engine companies started to phase out that voluntary way of notifying them.
Now, what the crawlers do is look at the URL links on the web pages it finds and goes back over all of those links. This cuts down on the bogus URLs and helps the crawler be thorough. The search engine crawls and indexes information, but search engines do further refinement before the information is available to the public. The companies perform spam detection and removal, duplication detection and removal, and also do some database quality testing. So, the information found on a website and indexed is not available in your search for several weeks.
All this comes at a cost. Crawling is extremely expensive for the search engine companies, so most search companies limit the number of pages that will be crawled on one website. That means that search crawlers may look at an entire website, but may only crawl a part of it, leaving a lot of valuable information not indexed.
These are sites that can be located, but are intentionally not included in the search engine indices. They are what Gary Price and Chris Sherman like to call the "opaque web." These are not part of the "invisible web." They are simply pages that cannot be indexed.
Besides cost, the other major issue with crawlers is the time it takes to log all this information. While some crawlers can index millions of pages in a day, there is sometimes a significant amount of time between when the information is put on the Web, when it is found by the crawler, and when the crawler returns to recrawl, looking for new material. These time lag issues lead to inaccuracy in your results.
There is an ongoing debate about the "freshness" of the search engines. Most search engine companies claim they constantly crawl and have only the freshest of information, but analysis by Gary Price and Greg Notess found that the search engine companies tend to be weeks behind on a regular basis, and many search tools are months behind in their efforts to recrawl and index material. How far behind and how much information they cover is a matter of some debate.
Our website is not responsible for the information contained by this article. Webworldarticles.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.
This article was sent to us by:
Damian Lissle at
08272010
1. Natural language search engines and when to use them
All articles in this directory are property of their respective authors. Additionally, read our Privacy Policy
© 2010 WebWorldarticles.com - All Rights Reserved.