計算機專業時文選讀:Deep Web

Deep Web

Most writers these days do a significant part of their research using the World Wide Web, with the help of powerful search engines such as Google and Yahoo. There is so much information available that one could be forgiven for thinking that “everything”is accessible this way, but nothing could ber further from the truth. For example, as of August 2005, Google claimed to have indexed 8.2 billion Web pages and 2.1 billion images. That sounds impressive, but it’s just the tip of the iceberg. Behold the deep Web.

According to Mike Bergman, chief technology officer at BrightPlanet Corp., more than 500 times as much information as traditional search engines “know about” is available in the deep Web. This massive store of information is locked up inside databases from which Web pages are generated in response to specific queries. Although these dynamic pages have a unique URL address with which they can be retrieved again, they are not persistent or stored as static pages, nor are there links to them from other pages.

The deep Web also includes sites that require registration or otherwise restrict access to their pages, prohibiting search engines from browsing them and creating cached copies.

Let’s recap how conventional search engines create their databases. Programs called spiders or Web crawlers start by reading pages from a starting list of Web sites. These spiders first read each page on a site, index all their content and add the words they find to the search engine’s growing database. When a spider finds a hyperlink to another page, it adds that new link to the list of pages to be indexed. In time, the program reaches all linked pages, presuming that the search engine doesn’t run out of time or storage space. These linked pages constitute what most of us use and refer to as the Internet or the Web. In fact, we have only scratched the surface, which is why this realm of information is often called the surface Web.

Why don’t our search engines find the deeper information? For starters, let’s consider a typical data store that an individual or enterprise has collected, containing books, texts, articles, images, laboratory results and various other kinds of data in diverse formats. Typically we access such database information by means of a query or search 。

we type in the subject or keyword we’re looking for, the database retrieves the appropriate content, and we are shown a page of results to our query.

If we can do this easily, why can’t a search engine? We assume that the search engine can reach the query input (or search) page, and it will capture the text on that page and in any pages that may have static hyperlinks to it. But unlike the typical human user, the spider can’t know what words it should type into the query field. Clearly, it can’t type in every word it knows about, and it doesn’t know what’s relevant to that particular site or database. If there’s no easy way to query, the underlying data remains invisible to the search engine. Indeed, any pages that are not eventually connected by links from pages in a spider’s initial list will be invisible and thus are not part of the surface Web as that spider defines it.

How Deep? How Big?

According to a 2001 BrightPlanet study, the deep Web is very big indeed: The company found that the 60 largest deep Web sources contained 84 billion pages of content with about 750TB of information. These 60 sources constituted a resource 40 times larger than the surface Web. Today, BrightPlanet reckons the deep Web totals 7500TB, with more than 250,000 sites and 500 billion individual documents. And that’s just for Web sites in English or European character sets. (For comparison, remember that Google, the largest crawler-based search engine, now indexes some 8 billion pages.)

The deep Web is getting deeper and bigger all the time. Two factors seem to account for this. First, newer data sources (especially those not in English) tend to be of the dynamic-query/searchable type, which are generally more useful than static pages. Second, governments at all levels around the world have made commitments to making their official documents and records available on the Web.

Interestingly, deep Web sites appear to receive 50% more monthly traffic than surface sites do, and they have more sites linked to them, even though they are not really known to the public. They are typically narrower in scope but likely to have deeper, more detailed content.

深度Web

如今,大多數的作者在他們的研究中利用萬維網,並藉助Google、Yahoo等公司強大的搜索引擎做大量的工作。(在網上)有如此多的信息可資利用,以至於那種認爲“所有信息”可以用此方法獲得的想法是有情可願的。然而,真實情況是難以掩蓋的。例如,在2005年8月,Google公司稱,它擁有82億頁和21億張編了索引的網頁和圖形。這聽起來非常驚人,但這還只是冰上一角。看看深度Web(就更驚人)。

據BrightPlanet公司的首席技術官Mike Bergman稱,在深度Web上,信息量是傳統搜索引擎“知道的”信息量的500倍。這些大量的信息鎖定在數據庫內,從中可以響應具體的查詢而生成網頁。雖然這些動態的網頁擁有唯一的URL地址,用此地址可以再次檢索到,但它們不是穩定不變的或者不是按靜態頁面存儲的,互相之間也沒有鏈接。

深度Web還包括需要註冊的或者對頁面限制訪問、禁止搜索引擎瀏覽和生成緩衝拷貝的網站。

讓我們重新看看常規搜索引擎是如何生成數據庫的。那些叫做蜘蛛程序或爬行程序的程序開始從網站的起始表讀取網頁。首先閱讀網站上的每一頁,對所有內容編索引,再把它們發現的字加入到不斷壯大的搜索引擎數據庫中。當蜘蛛程序發現與另一頁面的超鏈接時,它就把新的鏈接加入要編索引的頁面表中。只要搜索引擎沒有用光時間或存儲空間該程序就能馬上到達所有鏈接的頁面,這些鏈接的頁面就構成了我們中的大多數人使用和參照的因特網或萬維網。事實上,我們只是察看了表面,所以這個信息領域常常叫做表層Web。

爲什麼我們的搜索引擎不能發現更深一些的信息呢?對於一名初始者來說,讓我們來考慮一下個人或企業收集的典型數據集,它包括按不同格式保存的圖書、文章、圖像、實驗結果以及其他各種各樣的數據。一般而言,我們是通過查詢或搜索的方式訪問這些數據庫信息。

通常我們從鍵盤上輸入尋找的題目或關鍵字,數據庫會檢索相應的內容,並給我們顯示查詢結果的頁面。

如果我們能很容易做到這件事,那麼爲什麼搜索引擎就不能做到呢?假定搜索引擎能到達查詢輸入(或搜索)的頁面,它將捕捉到該頁面以及任何與其有靜態鏈接的頁面中的文本。但是蜘蛛程序與人類用戶不同,它不會知道應該把哪個字鍵入查詢字段。很明顯,它不會鍵入它知道的每個字,它也不知道與某個特定網站或數據庫存在什麼樣的關係。如果沒有簡易的查詢方法,底層的數據就看不見,因此不會成爲表層Web的一部分,蜘蛛程序也不能定義它。的確,任何網頁如果與蜘蛛程序最初列表中的網頁沒有鏈接關係的話,我們是看不見的,它也無法成爲蜘蛛程序所定義的表面Web中的一部分。

多深?多大?

據BrightPlanet公司2001年的研究,深度Web實際上非常滌:該公司發現60個最大的深度Web源包含了840億頁的內容,約750TB的信息量。這60個源構成的資源比表層Web大了40倍。如今,BrightPlanet公司估計深度Web總量達到7500TB,有25萬以上的網站和5000億個個人文檔。這僅是對英文或歐洲字符集的網站。(作爲比較,請記住Google這個最大的基於爬行程序的搜索引擎也只對80億頁面編有索引。)

隨時間的推移,深度Web變得越來越深、越來越大。造成此種情況有兩個因素:第一,較新的數據源(特別是那些非英語的數據源)更願意選擇動態查詢/可搜索類型,通常比靜態頁面更有用。第二,全世界的各級政府都已承諾官方文檔和記錄要上網。

有意思的是,似乎深度Web每月獲得的流量比表層Web多50%,它們擁有更多的網站與之鏈接,即使它們不爲公衆所知。通常,它們的範圍比較窄,但可能擁有更深、更詳細的內容。
 
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章