Doug Cutting 訪談錄 -- 關於搜索引擎的開發

Doug Cutting Interview

 

Doug Cutting is primary developer of the Lucene and Nutch open source search projects. He has worked in the search technology field for nearly two decades, including five years at Xerox PARC, three years at Apple, and four years at Excite.

 

What do you do for a living, and how are you involved in search engine development?

I work from home on Lucene & Nutch, two open source search projects. I get paid by various contracts related to these projects. I have an ongoing relationship with Yahoo! Labs that funds me to work part-time on Nutch. I take other short-term contract work related to both projects.

Could you briefly tell us about Nutch, and where you are trying to take it?

First, let me say what Lucene is, to provide context. Lucene is a software library for full full-text search. It's not an application, but rather technology that can be incorporated in applications. It's an Apache project and is widely used. A small subset of folks using Lucene are listed at wiki.apache.org/ jakarta-lucene/PoweredBy.

Nutch builds on Lucene to implement web search. Nutch is an application: you can download it and run it. It adds a crawler and other web-specific stuff to Lucene. Nutch aims to scale from simple intranet searching to search of the entire web, like Google and Yahoo!. To rival these guys you need a lot of tricks. We've demoed it on over 100M pages, and it's designed to scale to over 1B pages. But it also works well on a single machine, searching just a few servers.

From your perspective, what are the core principles of search engine architecture? What are the main things to consider and the big modules search engine software can be broken up into?

Let's see, the major bits are:

  • fetching – downloading lists of pages that have been referenced.

  • database – keeping track of what pages you've fetched, when you fetched them, what they've linked to, etc.

  • link analysis – analyzing the database to assign a priori scores to pages (e.g., PageRank & WebRank) and to prioritize fetching. The value of this is somewhat overrated. Indexing anchor text is probably more important (that's what makes, e.g., Google Bombing so effective).

  • indexing – combines content from the fetcher, incoming links from the database, and link analysis scores into a datastructure that's quickly searchable.

  • searching – ranks pages against a query using an index.

To scale to billions of pages, all of these must be distributable, i.e., each must be able to run in parallel on multiple machines.

You are saying people can download Nutch to run on their machines. Is there a possibility for small time webmasters who don't have full control over their Apache servers to make use of Nutch?

Unfortunately, most of them probably won't. Nutch requires a Java servlet container, which some ISPs support, but most do not.

Can I combine Lucene and the Google Web API, or Lucene and some other application I wrote?

A couple of folks have contributed Google-like APIs for Nutch, but none has yet made it into the system. We should have something like this soon, however.

What do you think is the biggest hurdle to overcome when implementing a search engine – is it the hardware and storage barrier, or the ranking algorithms? Also, how much space do you need to assure the search engine makes some sense, say, by writing an engine restricted to search a million RSS feeds?

Nutch requires around a total of 10kb per web page. RSS feeds tend to point to small pages, so you'd probably do better than that. Nutch doesn't yet have specific support for RSS.

Is it easy to get funded by Yahoo! Labs? Who can apply, and what do you need to give back in return?

I was invited, I didn't apply. I have no idea what the process is.

Did Google Inc. ever show interest in Nutch?

I've talked with folks there, including Larry Page. The'd like to help, but they can't see a way to do that without also helping their competitors.

In Nutch, do you implement your own PageRank or WebRank system? What considerations go into ranking?

Yes, Nutch has a link analysis module. Use of it is optional. For intranet search we find that it's really not needed.

I guess you heard it before, but doesn't an open-source search engine open itself up for blackhat Search Engine Optimization?

Potentially.

Let's say it takes spammers six weeks to reverse engineer a closed-source search engines latest spam detecting algorithm. With an open source engine, this can be done much faster. But in either case, the spammers will eventually figure out how it works; the only difference is how quickly. So the best anti-spam techniques, open or closed source, are those that continue to work even when their mechanism is known.

Also, if you, e.g., remove detected spammers from the index for six months, then there's not much they can do, once detected, to change their sites to elude detection. And if your spam detectors are based on statistical analyses of good and bad example sites, then you can, overnight, notice new patterns and remove the spammers before they have a chance to respond.

So open source can make it a little harder to stop spam, but it doesn't make it impossible. And closed-source search engines have not been able to use secrecy to solve this problem. I think the closed-source advantage here is not as great as folks imagine it to be.

How does Nutch relate to distributed Web crawler Grub, and what do you think of it?

As far as I can tell, Grub is a project that lets folks donate their hardware and bandwidth to LookSmart's crawling effort. Only the client is open source, not the server, so folks can neither deploy their own version of Grub, nor can they access the data that Grub gathers.

What about distributed crawling more generally? When a search engine gets big, crawl-related expenses are dwarfed by search-related expenses. So a distributed crawler doesn't significantly improve costs, rather it makes more complicated something that is already relatively inexpensive. That's not a good tradeoff.

Widely distributed search is interesting, but I'm not sure it can yet be done and keep things as fast as they need to be. A faster search engine is a better search engine. When folks can quickly revise queries then they more frequently find what they're looking for before they get impatient. But building a widely distributed search system that can search billions of pages in a fraction of a second is difficult, since network latencies are high. Most of the half-second or so that Google takes to perform a search is network latency within a single datacenter. If you were to spread that same system over a bunch of PCs in people's houses, even connected by DSL and cable modems, the latencies are much higher and searches would probably take several seconds or longer. And hence it wouldn't be as good of a search engine.

You are emphasizing the importance of speed in a search engine. I'm often puzzled by how fast Google returns a result. Do you have an idea how they do it, and what are your experience with Nutch?

I believe Google does roughly what Nutch does: they broadcast queries to a number of nodes, each which returns the top-results over a set of pages. With a couple of million pages per node then disk accesses can be avoided for most queries and each node can process tens to hundreds of queries per second. If you want to search billions of pages then you have to broadcast each query to thousands of nodes. That's a lot of network traffic.

Some of this is described in www.computer.org/ micro/mi2003/ m2022.pdf

When you mention spam, do you have any spam-fighting algorithms in Nutch? How can one differentiate between spam patterns like linkfarms, and sites which just happen to be very popular?

We haven't yet had time to start working on this, but it's obviously an important area. Before we get to link farms we need to do the simple stuff: look for word stuffing, white-on-white text, etc.

I think the key to search quality in general (of which spam detection is a sub-problem) is to have a trusted supply of hand-evaluated search results. With this, one can train a ranking algorithm to generate better results. (Spammy results are just one kind of bad results.) Commercial search engines get trusted evaluations by hiring folks. It remains to be seen how Nutch will do this. We obviously cannot just accept all evaluations donated, or else spammers will spam the evaluations. So we need a means of establishing the trustability of volunteer evaluators. I think a peer-review system, perhaps something like Slashdots's karma system, could work here.

Where do you see search engines heading in the near and far future, and what do you think are the biggest hurdles to overcome from a developer's perspective?

Sorry, I'm not very imaginative here. My prediction is that the web search in the coming decade is going to look more-or-less like web search of today. It's a safe bet. Web search evolved quickly for the first few years. It started in 1994 with WebCrawler, using standard information retrieval methods. The development of more web-specific methods took a few years, culminating in Google's 1998 launch. Since then, the introduction of new methods has slowed dramatically. The low-hanging fruit has been harvested. Innovation is easy when an area is young, and becomes more difficult as it field matures. Web search grew up in the 1990s, is now a cash cow, and will soon be a commodity.

As far as development challenges, I think operational reliability is a big one. We're working on developing something like GFS, the Google Filesystem. Stuff like this is essential to large-scale web search: you cannot let a failure of any single component cause a major hiccough; you must be able to easily scale by throwing more hardware into the pool, without massive reconfiguration; and you can't require an army of operators – things should largely fix themselves.

譯文 /Dedian

作爲LuceneNutch兩大Apach Open Source Project的始創人(其實還有Lucy, Lucene4C 和Hadoop等相關子項目),Doug Cutting 一直爲搜索引擎的開發人員所關注。他終於在爲Yahoo以Contractor的身份工作4年後,於今年正式以Employee的身份加入Yahoo

下面是筆者在工作之餘,翻譯其一篇2年前的訪談錄,原文(Doug Cutting Interview)在網上Google一下就容易找到。希望對搜索引擎開發的初學者起到一個拋磚引玉的效果。

(注:翻譯水平有限,不求雅,只求信,達。希望見諒)

1。請問你以何爲生?你是如何開始從事搜索引擎開發的?

我主要在家從事兩個與搜索有關的開源項目的開發: Lucene和Nutch. 錢主要來自於一些與這些項目相關的一些合同中。目前Yahoo! Labs 有一部分贊助在Nutch上。這兩個項目還有一些其他的短期合同 。

2。你能大概給我們講解一下Nutch嗎?以及你將在哪方面運用它?

我還是先說一下Lucene吧。Lucene其實是一個提供全文文本搜索的函數庫,它不是一個應用軟件。它提供很多API函數讓你可以運用到各種實際應用程序中。現在,它已經成爲Apache的一個項目並被廣泛應用着。這裏列出一些已經使用Lucene的系統

Nutch是一個建立在Lucene核心之上的Web搜索的實現,它是一個真正的應用程序。也就是說,你可以直接下載下來拿過來用。它在Lucene的基礎上加了網絡爬蟲和一些和Web相關的東東。其目的就是想從一個簡單的站內索引和搜索推廣到全球網絡的搜索上,就像Google和Yahoo一樣。當然,和那些巨人競爭,你得動一些腦筋,想一些辦法。我們已經測試過100M的網頁,並且它的設計用在超過1B的網頁上應該沒有問題。當然,讓它運行在一臺機器上,搜索一些服務器,也運行的很好。

3。在你看來,什麼是搜索引擎的核心元素?也就說,一般的搜索引擎軟件可以分成哪幾個主要部分或者模塊?

讓我想想,大概是如下幾塊吧:

 -- 攫取(fetching):就是把被指向的網頁下載下來。
 -- 數據庫:保存攫取的網頁信息,比如那些網頁已經被攫取,什麼時候被攫取的以及他們又有哪些鏈接的網頁等等。
 -- 鏈接分析:對剛纔數據庫的信息進行分析,給每個網頁加上一些權值(比如PageRank,WebRank什麼的),以便對每個網頁的重要性有所估計。不過,在我看來,索引那些網頁標記(Anchor)裏面的內容更爲重要。(這也是爲什麼諸如Google Bombing如此高效的原因)
 -- 索引(Indexing): 就是對攫取的網頁內容,以及鏈入鏈接,鏈接分析權值等信息進行索引以便迅速查詢。
 -- 搜索(Searching): 就是通過一個索引進行查詢然後按照網頁排名顯示。

當然,爲了讓搜索引擎能夠處理數以億計的網頁,以上的模塊都應該是分佈式的。也就是說,可以在多臺機器上並行運行。

4。你剛纔說大家可以立馬下載Nutch運行在自己的機器上。這是不是說,即便那些對Apache服務器沒有掌控權的網站管理員在短時間內就可以使用Nutch?

很不幸,估計他們大都沒戲。因爲Nutch還是需要一個Java servlet的容器(筆者注:比如Tomcat)。而這個有些ISP支持,但大都不支持。(筆者注: 只有對Apache服務器有掌控權,你才能在上面安裝一個Tomcat之類的東東)

5。我可以把Lucene和Google Web API結合起來嗎?或者和其他的一些我先前寫過的應用程序結合起來?

有那麼一幫人已經爲Nutch寫了一些類似Google的API, 但還沒有一個融入現在的系統。估計不久的將來就行了。

6。你認爲目前實現一個搜索引擎最大的障礙在哪裏?是硬件,存儲障礙還是排名算法?還有,你能不能告訴我大概需要多大的空間搜索引擎才能正常工作,就說我只想寫一個針對搜索成千上百萬的RSS feeds的一個搜索引擎吧。

Nutch大概一個網頁總共需要10kb的空間吧。Rss feeds的網頁一般都比較小(筆者注: Rss feeds都是基於xml的文本網頁,所以不會很大),所以應該更好處理吧。當然Nutch目前還沒有針對RSS的支持。(筆者注:實際上,API裏面有針對RSS的數據結構和解析)

7。從Yahoo! Labs拿到資金容易嗎?哪些人可以申請?你又要爲之做出些什麼作爲回報?

我是被邀請的,我沒有申請。所以我不是很清楚箇中的流程。

8。Google有沒有表示對Nutch感興趣?

我和那邊的一些傢伙談過,包括Larry Page(筆者注: Google兩個創始人之一)。他們都很願意提供一些幫助,但是他們也無法找到一種不會幫助到他們競爭對手的合適方式。

9。你有實現你自己的PageRank或者WebRank算法系統在你的Nutch裏嗎?什麼是你做網頁排名(Ranking)的考慮?

是的,Nutch裏面有一個鏈接分析模塊。它是可選的,因爲對於站內搜索來說,網頁排名是不需要的。

10。我想你以前有聽說過,就是對於一個開源的搜索引擎,是不是意味着同樣會給那些搞搜索引擎優化(SEO)的黑客們有機可趁?

恩,有可能。
就說利用反向工程破解的非開源搜索引擎中的最新的反垃圾信息檢測算法需要大概6個月的時間。對於一個開放源碼的搜索引擎來說,破解將會更快。但不管怎麼說,那些製造垃圾信息者最終總能找到破解辦法,唯一的區別就是破解速度問題。所以最好的反垃圾信息技術,不管開源也好閉源也好,就是讓別人知道了其中的機制之後也能繼續工作那一種。

還有,如果這六月中你是把檢測出來的垃圾信息從你的索引中移除,他們無計可施,他們只能改變他們的站點。如果你的垃圾信息檢測是基於對一些網站中好的和壞的例子的統計分析,你可以徹夜留意那些新的垃圾信息模式並在他們有機會反應之前將他們移除。

開源會使得禁止垃圾信息的任務稍稍艱鉅一點,但不是使之成爲不可能。況且,那些閉源的搜索引擎也並沒有祕密地解決這些問題。我想閉源的好處就是不讓我們看到它其實沒有我們想象的那麼好。

11。Nutch和分佈式的網絡爬蟲Grub相比怎麼樣?你是怎麼想這個問題的?

我能說的就是,Grub是一個能夠讓網民們貢獻一點自己的硬件和帶寬給巨大的LookSmart的爬行任務的一個工程。它只有客戶端是開源,而服務端沒有。所以大家並不能配置自己的Grub服務,也不能訪問到Grub收集的數據。

更一般意義的分佈式網絡爬行又如何?當一個搜索引擎變得很大的時候,其爬行上的代價相對搜索上需要付出的代價將是小巫見大巫。所以,一個分佈式爬蟲並不能是顯著降低成本,相反它會使得一些已經不是很昂貴的東西變得很複雜(筆者注:指pc和硬盤之類的硬件)。所以這不是一個便宜的買賣。

廣泛的分佈式搜索是一件很有趣的事,但我不能肯定它能否實現並保持速度足夠的快。一個更快的搜索引擎就是一個更好的搜索引擎。當大家可以任意快速更改查詢的時候,他們就更能在他們失去耐心之前頻繁找到他們所需的東西。但是,要建立一個不到1秒內就可以搜索數以億計的網頁的廣泛的分佈式搜索引擎是很難的一件事,因爲其中網絡有很高的延時。大都的半秒時間或者像Google展示它的查詢那樣就是在一個數據中心的網絡延時。如果你讓同樣一個系統運行在千家萬戶的家裏的PC上,即便他們用的是DSL和Cable上網,網絡的延時將會更高從而使得一個查詢很可能要花上幾秒鐘甚至更長的時間。從而他也不可能會是一個好的搜索引擎。

12。你反覆強調速度對於搜索引擎的重要性,我經常很迷惑Google怎麼就能這麼快地返回查詢結果。你認爲他們是怎麼做到的呢?還有你在Nutch上的經驗看法如何?

我相信Google的原理和Nutch大抵相同:就是把查詢請求廣播到一些節點上,每個節點返回一些頁面的頂級查詢結果。每個節點上保存着幾百萬的頁面,這樣可以避免大多查詢的磁盤訪問,並且每個節點可以每秒同時處理成十上百的查詢。如果你想獲得數以億計的頁面,你可以把查詢廣播到成千的節點上。當然這裏會有不少網絡流量。

具體的在這篇文章www.computer.org/ micro/mi2003/ m2022.pdf)中有所描述。

13。你剛纔有提到垃圾信息,在Nutch裏面是不是也有類似的算法?怎麼區別垃圾信息模式比如鏈接場(Linkfarms)(筆者注:就是一羣的網頁彼此互相鏈接,這是當初在1999年被一幫搞SEO弄出來的針對lnktomi搜索引擎的使網頁的排名得到提高的一種Spamdexing方法)和那些正常的受歡迎的站點鏈接。

這個,我們還沒有騰出時間做這塊。不過,很顯然這是一個很重要的領域。在我們進入鏈接場之前,我們需要做一些簡單的事情:察看詞彙填充(Word stuffing)(筆者注:就是在網頁裏嵌入一些特殊的詞彙,並且出現很多的次,甚至上百次,有些是人眼看不到的,比如白板寫白字等伎倆,這也是Spamdexing方法的一種),白板寫白字(White-on-white text),等等。

我想在一般意義上來說(垃圾信息檢測是其中的一個子問題),搜索質量的關鍵在於擁有一個對查詢結果手工可靠評估的輔助措施。這樣,我們可以訓練一個排名算法從而產生更好的查詢結果(垃圾信息的查詢結果是一種壞的查詢結果)。商業的搜索引擎往往會僱傭一些人進行可靠評估。Nutch也會這樣做,但很顯然我們不能只接受那些友情贊助的評估,因爲那些垃圾信息製造者很容易會防止那些評估。因此我們需要一種手段去建立一套自願評估者的信任體制。我認爲一個平等評論系統(peer-review system),有點像Slashdot的karma系統, 應該在這裏很有幫助。

14。你認爲搜索引擎在不久的將來路在何方?你認爲從一個開發者的角度來看,最大的障礙將在哪裏?

很抱歉,我不是一個想象力豐富的人。我的預測就是在未來的十年裏web搜索引擎將和現在的搜索引擎相差無幾。現在應該屬於平穩期。在最初的幾年裏,網絡搜索引擎確實曾經發展非常迅速。源於1994年的網絡爬蟲使用了標準的信息析取方法。直到1998年Google的出現,其間更多的基於Web的方法得到了發展。從那以後,新方法的引入大大放慢了腳步。那些樹枝低的果實已被收穫。創新只有在剛發展的時候比較容易,越到後來越成熟,越不容易創新。網絡搜索引擎起源於上個世紀90年代,現在儼然已成一顆搖錢樹,將來很快會走進人們的日常生活中。

至於開發上的挑戰,我認爲操作上的可靠性將是一個大的挑戰。我們目前正在開發一個類似GFS(Google的文件系統)的東西。它是巨型搜索引擎不可缺少的基石:你不能讓一個小組件的錯誤導致一個大的癱瘓。你應該很容易的讓系統擴展,只需往硬件池裏加更多硬件而不需繁縟的重新配置。還有,你不需要一大坨的操作人員完成,所有的一切將大都自己搞定

 
發佈了3 篇原創文章 · 獲贊 4 · 訪問量 10萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章