畢業論文摘要翻譯

好些天沒有翻譯了,答辯之後,人就鬆懈了下來。現在把畢業論文的摘要翻譯貼出來,也不知道翻譯得咋樣...
------------------------------------------------------------------------------------------------------------------------

隨着互聯網的迅猛發展,通用搜索引擎逐步顯現其侷限性,對此,定向抓取相關網頁資源的聚焦爬蟲應運而生。聚焦爬蟲是一個自動下載網頁的程序,它根據既定的抓取目標,有選擇的訪問萬維網上的網頁與相關的鏈接,獲取所需要的信息。與通用爬蟲不同,聚焦爬蟲並不追求大的覆蓋,而將目標定爲抓取與某一特定主題內容相關的網頁,爲面向主題的用戶查詢準備數據資源。

Heritrix由於其靈活的模塊式體系結構設計,爲開發者擴展相關部件定製符合特定需求的聚焦爬蟲提供了基礎。

開發垂直搜索引擎的時候,爲了方便全文檢索工具對數據資料建立索引,需要進一步處理網絡爬蟲獲取的數據,特別是網頁數據。而HTMLParser提供了提取文本信息的API,使我們擺脫繁瑣的正則匹配過程。

本文主要介紹如何基於開源爬蟲Heritrix進行擴展定製面向竹藤領域的網絡爬蟲,利用HTMLParser包對爬取的結果進行再次解析處理,並採用LAMP+jQuery 技術開發一個簡單的竹藤數據搜索引擎。


關鍵字    竹藤,網絡爬蟲, Heritrix, HTMLParser, LAMP


ABSTRACT

With the rapid development of the Internet,general search engine shows its limitations gradually,for solving these problems,Focused crawler which directionally grabs related web resources emerges at its proper moment。 Focused crawler is a program that downloads web page automatically 。According to a given target , it selectively  visits the web page and related links on the Internet, acquire the information we need。 Differing from general web crawler , in contrast to pursue a large coverage, focused crawler sets the target to grab the web page related to a specific topic, prepare data resource for subject- oriented user。

Because of its flexible architecture design,Heritrix provides a framework for developer to customizea web crawler meeting the needs of the Bamboo & rattan field through extensions.

When developing a vertical search engine,for the convenience of Full Text Search service to create index on datas, it is essential  to parser the datas web crawler acquired, especially web page. HTMLParser provides some APIs to extract text, so we can free ourselves from  the fussy process of pattern parser.

This paper primarily introduces how to develop a web crawler gearing to the needs of the Bamboo & rattan field,use the package of HTMLParser to parser the web pages that web crawler acquires and use LAMP and jQuery to develop a simple search engine of data resource related to Bamboo & rattan。


Key words   Bamboo & rattan,Web Crawler,  Heritrix, HTMLParser, LAMP

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章