Search engines miss a lot of material. The reality is there is a vast area of material that either is not found by the search tool or is intentionally neglected by the search engine for size or other limits. For example, when you use a search engine do a search on The New York Times, their content never shows up because it is not indexed by a search engine crawler. The New York Times requires you to fill out a registration form, and the search engine cannot crawl past the registration barrier. So while the content of The New York Times is available on the Internet, it falls into an area of content that has become known as the "invisible web."
Much of what search engines cannot get to is information tucked away in databases or behind "gated" sites where search crawlers cannot crawl. Consider this: if you were to go to The New York Times on the Web, the newspaper's material is free, but you must register first before you can look at the website. That barrier to entry - registration - blocks the material from ever being crawled by the search engine crawler, and causes an extraordinary amount of great material to be "invisible." So all of the contents of The New York Times is never indexed, simply because the search engine's crawler cannot get past the registration form.
Remember: it is costly to search, index, and store the results. So, many search engine crawlers limit the number of pages on a website they search, or limit the number of pages indexed, dumping older ones and replacing them with new ones, or restrict the kinds of pages they crawl by cataloging only certain types of domain names. Another problem is simple economics. It costs search engine companies money every time a page is fetched, indexed, and stored in the search engine's database. Since all of these companies must someday show a profit, many have figured out ways to cut costs.
Because crawling is expensive and the size of the Web continues to skyrocket, most search engines intentionally limit the number of pages that will be crawled and indexed from any one site. Many limit the total number of pages in their index, dumping older pages as newer ones are found. Others limit the frequency of their crawling time, making some pages stale or out-of-date. Others limit the crawling by only crawling certain parts of pages or domains they think will contain the most useful and reliable material. As many as 500 pages on one site may be crawled and still thousands more are never looked at. Sometimes pages are crawled and simply omitted or forgotten.
What makes this so important is that studies show that this "invisible web" dramatically dwarfs the size of all the material you see on the World Wide Web. A Bright Planet study in 2000 found the "invisible web" to be as much as 500 times larger than the visible web.
The "opaque web" consists of pages that can be files that can be found by search tools, but for one reason or another, are not included in search engine indices. This includes pages that are "hidden" behind dynamic navigation codes. For the most part, the data you find on an opaque database tends to be subject-focused and can not be easily found on a general purpose search engine. Opaque web data is more precise, more current, and more authoritative than what you would find using more general tools.
A second group of files consists of technically indexable pages that have deliberately been excluded from research engines by web page designers. These "private web" pages are ones where a password has been set up to protect the page from crawlers.
A third group, the "proprietary web" are pages that have been roped off or blocked access to and are only accessible to people who have agreed to special terms in exchange for seeing the content. These include agreeing to fill out a registration form to get access to The New York Times pages. Perhaps the biggest difference between the private web and the proprietary web is money. Material accessible on the invisible web is not the same as that found on the proprietary web. In most cases the "invisible web" material is free, or inexpensive, where proprietary websites can be very expensive.
The fourth category is the "truly invisible web" which includes pages that use file formats that current-generation web crawlers are not programmed to handle. Most search engines were originally designed to focus on text. So pages that just have photos, graphics files, sound or video and no text are often missed altogether by the search engines. That is because they do not have words that can be indexed easily. Most search engines look at the coding on a page and record the manner, file name, and location details of the page, but not a whole lot else. So a page that consists of images, sound, video and no text is something the crawler cannot handle. It falls into the "truly invisible web" category.
Our website is not responsible for the information contained by this article. Webworldarticles.com is a free articles resource thus practically any visitor can submit an article. However if you notice any copyrighted material, please contact us and we will remove the article(s) in discussion right away.
This article was sent to us by:
Matt Richards at
08282010
1. What are meta search tools and what purpose do they have
All articles in this directory are property of their respective authors. Additionally, read our Privacy Policy
© 2010 WebWorldarticles.com - All Rights Reserved. Partners: Gunblade Saga