Jericho Html Parser初探

Jericho Html Parser初探

作者：SharpStill

Jericho作爲其SourceForge上人氣最高的最新Html解析框架，自然有其強大的理由。但是由於目前中國人使用的不多，因此網上的中文教程和資料不多，所以造成了大家的學習困難。因此，我們從學習複雜度，代碼量等初學者入門指標來看看這個框架的魔力吧。可以使用製作開源爬蟲引擎。

這個例子我們以淘寶這樣的購物網站作爲解析實例。

淘寶網的頁面分爲http://list.taobao.com http://www.taobao.com/go/chn/game,（類似album）和http://item.taobao.com(類似video)和麪還有許許多多這樣的頁面，我們利用Jericho Html Parser作爲頁面解析框架，來看一下他的威力。

這個網頁解析框架的xml書寫如下：

JerichoHtml Parser的核心的類便是Source類，source類代表了html文檔，他可以從URL得到文檔或者從String得到。

In certain circumstances you may be able to improve performance bycalling thefullSequentialParse()method before calling anytagsearch methods. See the documentation of thefullSequentialParse()method for details.

在其說明文檔中有這樣一句話，就是說如果在特定情況下可以使用fullSequentialParse()方法，提高解析速度，這個方法裏的說明：Calling this method can greatly improve performance if most or allof the tags in the document need to be parsed.

如果在一個類裏將大部分或者所有的tag標記都解析了的話，比如我們經常需要提取出網頁所有的URL或者圖片鏈接，在這種情況下使用這種方法可以加快提取速度，但是值得注意的一點是：只有在Source對象被new出來的後面一句緊接着調用這句話有效。緊接着調用Tag Search Method(文檔中有詳細說明)即可。

我們以提取這個頁面爲例：

這個頁面包含以下幾點：價格，運費信息，所在地區，收藏人氣，寶貝類型。利用這個頁面提取，看看其編程效率能提高多少。

package com.test.html;

import java.util.List;

import net.htmlparser.jericho.Element;
import net.htmlparser.jericho.Source;

import com.test.html.bean.ShoppingDetail;

public class HtmlParseTest {
	
	public static ShoppingDetail extract(String inputHtml){
		Source source = new Source(inputHtml);
		Element form  = source.getElementById("J_FrmBid");
		List<Element> inputArea = form.getAllElements("input");
		String price ="";
		String area ="";
		String transportInfo="";
		
		for(Element input : inputArea){
			if(input.getAttributeValue("name").equals("buy_now"))price = input.getAttributeValue("value");
			if(input.getAttributeValue("name").equals("region"))area =  input.getAttributeValue("value");
			if(input.getAttributeValue("name").equals("who_pay_ship"))transportInfo =  input.getAttributeValue("value");
			
		}
		Element others  = source.getAllElementsByClass("other clearfix").get(0);
		String otherInfo = others.getContent().getTextExtractor().toString().trim();
		int startBabyType =otherInfo.indexOf("寶貝類型：");
		int endBabyType=  otherInfo.indexOf("收藏人氣：");
		String babyType = otherInfo.substring(startBabyType+5,endBabyType);
		int endStore = otherInfo.indexOf("類似收藏");
		String storeCount = otherInfo.substring(endBabyType+5,endStore-1).trim();
		ShoppingDetail detail = new ShoppingDetail();
		detail.setArea(area);
		detail.setBabyType(babyType);
		detail.setPrice(price);
		detail.setStroreCount(Integer.parseInt(storeCount));
		detail.setTransportInfo(transportInfo);
		return detail;
	}
	
	public static void main(String[] args) throws Exception {
		String content = HttpClientUtils.getContent("http://item.taobao.com/item.htm?id=3144581940", "UTF-8");
		ShoppingDetail detail = HtmlParseTest.extract(content);
		System.out.println(detail);

運行結果如下所示

com.test.html.bean.ShoppingDetail@1aaa14a[title=索尼 PSP 4.3寸搖桿遊戲機 8G mp5高清電影+TV輸出+收音,price=235.00,transportInfo=賣家承擔運費,area=廣東深圳,stroreCount=17711,babyType=全新 ]

提取這個頁面的核心代碼僅爲上面的那個函數，可以說從編程的複雜度而言比HtmlParser減少了不少，而且也沒有繁瑣的訪問者模式了。

下面作爲對比，我們來看看HtmlParser在這個頁面解析上所做的代碼量：

@Override
	public boolean execute(Context context) throws Exception {
		//spring若與commons chain可以集成IOC的話，就刪去
		parserTool = new SingleHtmlParserTool();
		filterUtils = new FilterUtils();
		filterUtils.setTool(new PropertiesTool("/Config/Properties/TaobaoFilter.properties"));
		
		//這一段若研究了spring的膠水機制就刪了
		
		String url = ((String)context.get("url")).toLowerCase();
		
		String seller_Url = "";
		String seller_Nickname = "";
		String seller_Taobao_Id = "";
		
		
		if(url.contains(Url_Pattern)){
			System.out.println("開始提取數據自"+url);
			NodeList forms = parserTool.getContainerInnerTags(url, formFilter);
			Map<String,String> valueMap = new HashMap<String,String>();//key-value map to store
			for(int i = 0;i<forms.size();i++){
				FormTag form = (FormTag)forms.elementAt(i);
				NodeList inputs = form.getFormInputs();
				for(int j = 0;j<inputs.size();j++){
					InputTag input = (InputTag)inputs.elementAt(j);
					String inputName = input.getAttribute("name");
					String inputValue = input.getAttribute("value");
					valueMap.put(inputName, inputValue);
				}
				
				//進入賣家分析程序分析時要用的一些Context中的值
				seller_Taobao_Id = valueMap.get("seller_id");
				seller_Nickname = valueMap.get("seller_nickname");
				seller_Url = "http://rate.taobao.com/user-rate-"+seller_Taobao_Id+".htm";
				//進入賣家分析程序分析時要用的一些Context中的值
				
				filterUtils.filterMap(valueMap, filterStoreItemKey,
						merchandiseItemSeparatorKey,
						DictItem.MerchandiseItemSiteToEngine);// 過濾網址上的其他雜質<inputname="雜質" />，保證最後存於數據庫的map被返回
				valueMap.put("TaobaoSite_Url", url);// 最後把網頁"內容"裏提取不出來的url加入其中

				// 存入數據庫
				IBasicSqlDao dao = BeanMapDaoUsingCommonsBeanutils.getInstance();
				dao.storeMapToDb(valueMap, sqlId, MerchandiseItem.class);
				// 存入數據庫
				for(Map.Entry<String, String> entry : valueMap.entrySet()){
					System.out.println(entry.getKey()+":"+entry.getValue());
				}
				System.out.println("---頁面"+url+"提取完畢---");
				//重構可以刪去
				valueMap.clear();// 清除map
			}
			Context sellerContext = new UrlContext(seller_Url);
			sellerContext.put("Seller_Nickname", seller_Nickname);
			sellerContext.put("Seller_Taobao_Id", seller_Taobao_Id);
			EmergencyLinkUrl.addUnVisitUrl(sellerContext, this);
			
			
			return true;
		}
		return false;
	}

可以明顯看到，Jercho從編程風格來看，擁有這幾大優勢：

去除了內部類，由於Html Parser使用過濾器或訪問者模式，不可避免地引入內部類。
編程使用泛型，簡化編寫，Html Parser沒有使用泛型作爲其框架，可能其創造年代較早，但是大量使用了設計模式，可以窺見其作者對設計模式的功力很深。
可以直接提取頁面文本信息，將HTML標籤去除。這個在全文搜索時非常常用。
Tag Search Method類似XPath的提取方式，可以不受限制地提取若干層以下的元素Element。

目前觀察到的幾大優勢使我們有理由相信，Jercho在未來的Html解析屆會成爲翹楚的。以下爲一篇不錯的Jercho Html Parser文章推薦下：

http://www.ehelper.com.cn/blog/post/httpclient-jericho.html

我又查了一下資料，發現 jsoup HTML解析器比 htmpparser 更好
HtmlUnit 比 httpclient更易用

Jericho Html Parser初探

軟件測試基礎_軟件缺陷管理學習筆記

軟件測試基礎_什麼是軟件測試

2015惠普測試4班，加油！Fighting~

軟件測試基礎_零基礎學測試

總結_高效能人士的七個習慣

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結