JAVA抓取網頁的簡單實現

原創

2018-09-03 14:37

最近在做java的爬蟲，由於剛開始的時候使用的是httpclient,但是逐漸發現，有的功能不能實現，因此，自己利用java的net包做了一個爬蟲，實現網頁的基本抓取，其中考慮了瀏覽器的僞裝，gzip格式的解碼等困擾比較久的問題。代碼如下

/**
 * @author houlaizhexq
 * @function:依靠java自己的net包實現的爬蟲，解決瀏覽器僞裝，gzip解碼等問題
 * @time:2013年12月6日 星期五
 */

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.zip.GZIPInputStream;

public class Test5 {
	public static void main(String args[]){
		URL url;
		try {
			url = new URL("http://www.baidu.com");
			HttpURLConnection httpClientConnection=(HttpURLConnection) url.openConnection();
			
//			模仿IE的瀏覽器信息，防止網站禁止抓取網頁
//			Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*
//			Accept-Language: zh-cn
//			Accept-Encoding: gzip, deflate
//			User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)
//			Host: 192.168.109.130
//			Connection: close
			httpClientConnection.addRequestProperty("Accept", "image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*");
			httpClientConnection.addRequestProperty("Accept-Language", "zh-cn");
			httpClientConnection.addRequestProperty("Accept-Encoding", "gzip, deflate");
			httpClientConnection.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)");
			httpClientConnection.addRequestProperty("Connection", "close");
			
			//設置爲get方法,默認即爲get方法
			httpClientConnection.setRequestMethod("GET");
//			//如果是表單的提交之類的需要設置爲POST方法，如下
//			//因爲需要寫數據，所以設置爲true
//			httpClientConnection.setDoOutput(true);
//			httpClientConnection.setRequestMethod("POST");
//			//設置表單內的數據
//			String username="user=admin";
//			httpClientConnection.getOutputStream().write(username.getBytes());
//			//寫入服務器
//		    httpClientConnection.getOutputStream().flush();
//		    //關閉寫出流
//		    httpClientConnection.getOutputStream().close();
			
			httpClientConnection.connect();
			//提取返回頭信息
			System.out.println(httpClientConnection.getResponseCode());
			for(int j=1;;j++){
				String header=httpClientConnection.getHeaderField(j);
				if(header==null){
					break;
				}
				System.out.println(httpClientConnection.getHeaderFieldKey(j)+":"+header);
			}
			
			//獲取返回體正文信息
			BufferedReader bufferedReader=null;
			InputStream inputStream=null;
			
			//查看encoding是不是gzip，是的話，先解壓成inputstream正常文件流，否則，不需要轉換
			if(httpClientConnection.getContentEncoding()!=null){
				String encode=httpClientConnection.getContentEncoding().toLowerCase();
				if(encode.indexOf("gzip") >= 0){
					//轉化gzip
					inputStream = new GZIPInputStream(httpClientConnection.getInputStream());
				}else{
					inputStream = httpClientConnection.getInputStream(); 
				}
			}
			if(inputStream!=null){
				bufferedReader = new BufferedReader(new InputStreamReader(inputStream,"UTF-8")); 
				String line = null; 
				while ((line = bufferedReader.readLine()) != null) { 
					System.out.println(line);
				} 
			}
			
			//關閉流
			inputStream.close();
			bufferedReader.close();
			//斷開連接
			httpClientConnection.disconnect();
			
		} catch (MalformedURLException e) {
			e.printStackTrace();
		} catch (IOException e) {
			e.printStackTrace();
		}				
	}
}

注意：其中的gzip主要是用來解決壓縮的問題，以減少流量，其中返回的是否gzip主要根據你發送的請求Accept-Encoding字段判斷，由於瀏覽器都會自動解碼，所以一般返回gzip格式，但是自己的爬蟲，如果不聲明能接受gzip編碼，則會返回正常的編碼，這樣在抓取的返回頭文件不會出現content-encoding:gzip字段

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

JAVA抓取網頁的簡單實現

【SQL進階】CASE語句的使用

npm error Cannot read properties of null (reading 'isDescendantOf')

JAVA抓取網頁的簡單實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結