Java爬蟲學習:利用HttpClient和Jsoup庫實現簡單的Java爬蟲程序

利用HttpClient和Jsoup庫實現簡單的Java爬蟲程序

HttpClient簡介

HttpClient是Apache Jakarta Common下的子項目,可以用來提供高效的、最新的、功能豐富的支持HTTP協議的客戶端編程工具包,並且它支持 HTTP 協議最新的版本。它的主要功能有:

  • (1) 實現了所有 HTTP 的方法(GET,POST,PUT,HEAD 等)
  • (2) 支持自動轉向
  • (3) 支持 HTTPS 協議
  • (4) 支持代理服務器等

Jsoup簡介

jsoup是一款Java的HTML解析器,可直接解析某個URL地址、HTML文本內容。它提供了一套非常省力的API,可通過DOM,CSS以及類似於jQuery的操作方法來取出和操作數據。它的主要功能有:
- (1) 從一個URL,文件或字符串中解析HTML;
- (2) 使用DOM或CSS選擇器來查找、取出數據;
- (3) 可操作HTML元素、屬性、文本;

使用步驟

maven項目添加依賴

pom.xml文件依賴如下:

<dependency>
    <groupId>org.apache.httpcomponents</groupId>
    <artifactId>httpclient</artifactId>
    <version>4.5.2</version>
</dependency>

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.8.3</version>
</dependency>

編寫Junit測試代碼

代碼


import org.apache.http.HttpEntity;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.junit.Test;

import java.util.List;

/**
 * HttpClient & Jsoup libruary test class
 *
 * Created by xuyh at 2017/11/6 15:28.
 */
public class HttpClientJsoupTest {
    @Test
    public void test() {
            //通過httpClient獲取網頁響應,將返回的響應解析爲純文本
        HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
        httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
        CloseableHttpClient httpClient = null;
        CloseableHttpResponse response = null;

        String responseStr = "";
        try {
            httpClient = HttpClientBuilder.create().build();
            HttpClientContext context = HttpClientContext.create();
            response = httpClient.execute(httpGet, context);
            int state = response.getStatusLine().getStatusCode();
            if (state != 200)
                responseStr = "";
            HttpEntity entity = response.getEntity();
            if (entity != null)
                responseStr = EntityUtils.toString(entity, "utf-8");
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            try {
                if (response != null)
                    response.close();
                if (httpClient != null)
                    httpClient.close();
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }

        if (responseStr == null)
            return;

        //將解析到的純文本用Jsoup工具轉換成Document文檔並進行操作
        Document document = Jsoup.parse(responseStr);
        List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
                .getElementsByAttributeValue("class", "phdnews_hdline");
        elements.forEach(element -> {
            for (Element e : element.getElementsByTag("a")) {
                System.out.println(e.attr("href"));
                System.out.println(e.text());
            }
        });
    }
}

詳解

  • 新建HttpGet對象,對象將從 http://sports.sina.com.cn/ 這個URL地址獲取GET響應。並設置socket超時時間和連接超時時間分別爲30000ms。
HttpGet httpGet = new HttpGet("http://sports.sina.com.cn/");
httpGet.setConfig(RequestConfig.custom().setSocketTimeout(30000).setConnectTimeout(30000).build());
  • 通過HttpClientBuilder新建一個CloseableHttpClient對象,並執行上面的HttpGet規定的請求,將響應放在新建的HttpClientContext對象中。最後從HttpClientContext對象中獲取響應的文本格式。
CloseableHttpClient httpClient = null;
CloseableHttpResponse response = null;

String responseStr = "";
try {
    httpClient = HttpClientBuilder.create().build();
    HttpClientContext context = HttpClientContext.create();

    response = httpClient.execute(httpGet, context);

    int state = response.getStatusLine().getStatusCode();
    if (state != 200)
        responseStr = "";


    HttpEntity entity = response.getEntity();
    if (entity != null)
        responseStr = EntityUtils.toString(entity, "utf-8");


} catch (Exception e) {
    e.printStackTrace();
} finally {
    try {
        if (response != null)
            response.close();
        if (httpClient != null)
            httpClient.close();
    } catch (Exception ex) {
        ex.printStackTrace();
    }
}
  • 將響應的文本用Jsoup庫解析,得到其中的各個元素
Document document = Jsoup.parse(responseStr);

List<Element> elements = document.getElementsByAttributeValue("class", "phdnews_txt fr").first()
        .getElementsByAttributeValue("class", "phdnews_hdline");

elements.forEach(element -> {
    for (Element e : element.getElementsByTag("a")) {
        System.out.println(e.attr("href"));
        System.out.println(e.text());
    }
});
  • Jsoup的Document對象繼承自org.jsoup.nodes.Element類和Element均有的部分方法:
public Element getElementById(String id);//通過id獲取元素
public Elements getElementsByClass(String className);//通過className獲取元素
public Elements getElementsByAttributeValue(String key, String value);//通過屬性值獲取元素
public Elements getElementsByTag(String tagName);//通過標籤名獲取元素
public String attr(String attributeKey);//獲取本元素的屬性值
public String text();//獲取本元素的內容
  • 其中HTML規定的元素格式爲:
<div class="code">  <!--div 是元素的標籤--> <!--class="code" 是元素的屬性和屬性值-->
    <div>
        <br>
            這是第一個段落。    <!--元素的內容-->
        <br>
    </div>
</div>

運行結果

  • 運行結果如下所示
http://sports.sina.com.cn/sportsevents/3v3/2017-11-05/doc-ifynmzrs7218551.shtml
3X3黃金聯賽冠軍賽山西隊奪冠!獨享48http://video.sina.com.cn/sports/k/cba/1105final3x3/
視頻
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/181467390769.html
黃金mvp集錦
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/170167390621.html
直搗黃龍1v2
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/183267390917.html
5佳球:庫裏式虛晃
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/150067390331.html
大嫂徐鼕鼕亮相
http://video.sina.com.cn/p/sports/k/v/doc/2017-11-05/145367390313.html
現場衆多美女雲集
http://video.sina.com.cn/p/sports/c/zj/v/doc/2017-11-05/150867390337.html
啦啦隊熱舞表演
http://sports.sina.com.cn/nba/
哈登56分周琦暴扣火箭勝
http://sports.sina.com.cn/basketball/nba/2017-11-06/doc-ifynmzrs7300047.shtml
詹皇26分騎士負
  • 爬取的網頁內容區域爲下圖所示:

這裏寫圖片描述

編寫工具類

將HttpClient和Jsoup進行封裝,形成一個工具類,內容如下:


import org.apache.http.HttpEntity;
import org.apache.http.NameValuePair;
import org.apache.http.client.CookieStore;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.client.entity.UrlEncodedFormEntity;
import org.apache.http.client.methods.CloseableHttpResponse;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.client.methods.HttpPost;
import org.apache.http.client.protocol.HttpClientContext;
import org.apache.http.conn.ssl.SSLConnectionSocketFactory;
import org.apache.http.cookie.Cookie;
import org.apache.http.entity.ContentType;
import org.apache.http.entity.StringEntity;
import org.apache.http.impl.client.CloseableHttpClient;
import org.apache.http.impl.client.HttpClientBuilder;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.message.BasicNameValuePair;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.util.EntityUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

import javax.net.ssl.*;
import java.io.IOException;
import java.security.GeneralSecurityException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * <pre>
 * Http工具,包含:
 * 普通http請求工具(使用httpClient進行http,https請求的發送)
 * </pre>
 * Created by xuyh at 2017/7/17 19:08.
 */
public class HttpUtils {
    /**
     * 請求超時時間,默認20000ms
     */
    private int timeout = 20000;
    /**
     * cookie表
     */
    private Map<String, String> cookieMap = new HashMap<>();

    /**
     * 請求編碼(處理返回結果),默認UTF-8
     */
    private String charset = "UTF-8";

    private static HttpUtils httpUtils;

    private HttpUtils() {
    }

    /**
     * 獲取實例
     *
     * @return
     */
    public static HttpUtils getInstance() {
        if (httpUtils == null)
            httpUtils = new HttpUtils();
        return httpUtils;
    }

    /**
     * 清空cookieMap
     */
    public void invalidCookieMap() {
        cookieMap.clear();
    }

    public int getTimeout() {
        return timeout;
    }

    /**
     * 設置請求超時時間
     *
     * @param timeout
     */
    public void setTimeout(int timeout) {
        this.timeout = timeout;
    }

    public String getCharset() {
        return charset;
    }

    /**
     * 設置請求字符編碼集
     *
     * @param charset
     */
    public void setCharset(String charset) {
        this.charset = charset;
    }

    /**
     * 將網頁返回爲解析後的文檔格式
     * 
     * @param html
     * @return
     * @throws Exception
     */
    public static Document parseHtmlToDoc(String html) throws Exception {
        return removeHtmlSpace(html);
    }

    private static Document removeHtmlSpace(String str) {
        Document doc = Jsoup.parse(str);
        String result = doc.html().replace("&nbsp;", "");
        return Jsoup.parse(result);
    }

    /**
     * 執行get請求,返回doc
     *
     * @param url
     * @return
     * @throws Exception
     */
    public Document executeGetAsDocument(String url) throws Exception {
        return parseHtmlToDoc(executeGet(url));
    }

    /**
     * 執行get請求
     *
     * @param url
     * @return
     * @throws Exception
     */
    public String executeGet(String url) throws Exception {
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
        httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpClient httpClient = null;
        String str = "";
        try {
            httpClient = HttpClientBuilder.create().build();
            HttpClientContext context = HttpClientContext.create();
            CloseableHttpResponse response = httpClient.execute(httpGet, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            int state = response.getStatusLine().getStatusCode();
            if (state == 404) {
                str = "";
            }
            try {
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    str = EntityUtils.toString(entity, charset);
                }
            } finally {
                response.close();
            }
        } catch (IOException e) {
            throw e;
        } finally {
            try {
                if (httpClient != null)
                    httpClient.close();
            } catch (IOException e) {
                throw e;
            }
        }
        return str;
    }

    /**
     * 用https執行get請求,返回doc
     *
     * @param url
     * @return
     * @throws Exception
     */
    public Document executeGetWithSSLAsDocument(String url) throws Exception {
        return parseHtmlToDoc(executeGetWithSSL(url));
    }

    /**
     * 用https執行get請求
     *
     * @param url
     * @return
     * @throws Exception
     */
    public String executeGetWithSSL(String url) throws Exception {
        HttpGet httpGet = new HttpGet(url);
        httpGet.setHeader("Cookie", convertCookieMapToString(cookieMap));
        httpGet.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpClient httpClient = null;
        String str = "";
        try {
            httpClient = createSSLInsecureClient();
            HttpClientContext context = HttpClientContext.create();
            CloseableHttpResponse response = httpClient.execute(httpGet, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            int state = response.getStatusLine().getStatusCode();
            if (state == 404) {
                str = "";
            }
            try {
                HttpEntity entity = response.getEntity();
                if (entity != null) {
                    str = EntityUtils.toString(entity, charset);
                }
            } finally {
                response.close();
            }
        } catch (IOException e) {
            throw e;
        } catch (GeneralSecurityException ex) {
            throw ex;
        } finally {
            try {
                if (httpClient != null)
                    httpClient.close();
            } catch (IOException e) {
                throw e;
            }
        }
        return str;
    }

    /**
     * 執行post請求,返回doc
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public Document executePostAsDocument(String url, Map<String, String> params) throws Exception {
        return parseHtmlToDoc(executePost(url, params));
    }

    /**
     * 執行post請求
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public String executePost(String url, Map<String, String> params) throws Exception {
        String reStr = "";
        HttpPost httpPost = new HttpPost(url);
        httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
        List<NameValuePair> paramsRe = new ArrayList<>();
        for (String key : params.keySet()) {
            paramsRe.add(new BasicNameValuePair(key, params.get(key)));
        }
        CloseableHttpClient httpclient = HttpClientBuilder.create().build();
        CloseableHttpResponse response;
        try {
            httpPost.setEntity(new UrlEncodedFormEntity(paramsRe));
            HttpClientContext context = HttpClientContext.create();
            response = httpclient.execute(httpPost, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            HttpEntity entity = response.getEntity();
            reStr = EntityUtils.toString(entity, charset);
        } catch (IOException e) {
            throw e;
        } finally {
            httpPost.releaseConnection();
        }
        return reStr;
    }

    /**
     * 用https執行post請求,返回doc
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public Document executePostWithSSLAsDocument(String url, Map<String, String> params) throws Exception {
        return parseHtmlToDoc(executePostWithSSL(url, params));
    }

    /**
     * 用https執行post請求
     *
     * @param url
     * @param params
     * @return
     * @throws Exception
     */
    public String executePostWithSSL(String url, Map<String, String> params) throws Exception {
        String re = "";
        HttpPost post = new HttpPost(url);
        List<NameValuePair> paramsRe = new ArrayList<>();
        for (String key : params.keySet()) {
            paramsRe.add(new BasicNameValuePair(key, params.get(key)));
        }
        post.setHeader("Cookie", convertCookieMapToString(cookieMap));
        post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpResponse response;
        try {
            CloseableHttpClient httpClientRe = createSSLInsecureClient();
            HttpClientContext contextRe = HttpClientContext.create();
            post.setEntity(new UrlEncodedFormEntity(paramsRe));
            response = httpClientRe.execute(post, contextRe);
            HttpEntity entity = response.getEntity();
            if (entity != null) {
                re = EntityUtils.toString(entity, charset);
            }
            getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
        } catch (Exception e) {
            throw e;
        }
        return re;
    }

    /**
     * 發送JSON格式body的POST請求
     *
     * @param url 地址
     * @param jsonBody json body
     * @return
     * @throws Exception
     */
    public String executePostWithJson(String url, String jsonBody) throws Exception {
        String reStr = "";
        HttpPost httpPost = new HttpPost(url);
        httpPost.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        httpPost.setHeader("Cookie", convertCookieMapToString(cookieMap));
        CloseableHttpClient httpclient = HttpClientBuilder.create().build();
        CloseableHttpResponse response;
        try {
            httpPost.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
            HttpClientContext context = HttpClientContext.create();
            response = httpclient.execute(httpPost, context);
            getCookiesFromCookieStore(context.getCookieStore(), cookieMap);
            HttpEntity entity = response.getEntity();
            reStr = EntityUtils.toString(entity, charset);
        } catch (IOException e) {
            throw e;
        } finally {
            httpPost.releaseConnection();
        }
        return reStr;
    }

    /**
     * 發送JSON格式body的SSL POST請求
     *
     * @param url 地址
     * @param jsonBody json body
     * @return
     * @throws Exception
     */
    public String executePostWithJsonAndSSL(String url, String jsonBody) throws Exception {
        String re = "";
        HttpPost post = new HttpPost(url);
        post.setHeader("Cookie", convertCookieMapToString(cookieMap));
        post.setConfig(RequestConfig.custom().setSocketTimeout(timeout).setConnectTimeout(timeout).build());
        CloseableHttpResponse response;
        try {
            CloseableHttpClient httpClientRe = createSSLInsecureClient();
            HttpClientContext contextRe = HttpClientContext.create();
            post.setEntity(new StringEntity(jsonBody, ContentType.APPLICATION_JSON));
            response = httpClientRe.execute(post, contextRe);
            HttpEntity entity = response.getEntity();
            if (entity != null) {
                re = EntityUtils.toString(entity, charset);
            }
            getCookiesFromCookieStore(contextRe.getCookieStore(), cookieMap);
        } catch (Exception e) {
            throw e;
        }
        return re;
    }

    private void getCookiesFromCookieStore(CookieStore cookieStore, Map<String, String> cookieMap) {
        List<Cookie> cookies = cookieStore.getCookies();
        for (Cookie cookie : cookies) {
            cookieMap.put(cookie.getName(), cookie.getValue());
        }
    }

    private String convertCookieMapToString(Map<String, String> map) {
        String cookie = "";
        for (String key : map.keySet()) {
            cookie += (key + "=" + map.get(key) + "; ");
        }
        if (map.size() > 0) {
            cookie = cookie.substring(0, cookie.length() - 2);
        }
        return cookie;
    }

    /**
     * 創建 SSL連接
     *
     * @return
     * @throws GeneralSecurityException
     */
    private static CloseableHttpClient createSSLInsecureClient() throws GeneralSecurityException {
        try {
            SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, (chain, authType) -> true).build();
            SSLConnectionSocketFactory sslConnectionSocketFactory = new SSLConnectionSocketFactory(sslContext,
                    (s, sslContextL) -> true);
            return HttpClients.custom().setSSLSocketFactory(sslConnectionSocketFactory).build();
        } catch (GeneralSecurityException e) {
            throw e;
        }
    }
}

上面的工具類不僅可以進行網頁內容的獲取,還能夠進行http請求的發送。

源碼地址

https://github.com/johnsonmoon/HttpUtils.git
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章