爬蟲記錄(3)——模擬登錄獲取cookie,訪問私信頁面

繼上一篇博文 爬蟲記錄(2)——簡單爬取一個頁面的圖片並保存 ,今天我們通過httpclient模擬表單登錄開源中國,獲取cookie,然後通過cookie訪問個人私信頁面。

1、準備工作

模擬表單登錄,首先需要知道登錄的url,以及登錄表單的字段。這裏我們圖一中故意輸入一個錯誤的用戶名和密碼,然後通過查看圖二中的network中,發現登錄的url是https://www.oschina.net/action/user/hash_login?from=,字段是 賬號爲email , 密碼是pwd
而且密碼通過相應的處理,我們先不管他,反正輸入的是正確的密碼,直接拿圖中字符串即可。


圖一

這裏寫圖片描述


圖二

這裏寫圖片描述

2、修改相應的爬蟲工具類CrawlerUtils

增加post方法,插入消息體。也修改相應的get方法,增加插入cookie的方法。

package com.dyw.crawler.util;

import org.apache.commons.httpclient.Cookie;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.cookie.CookiePolicy;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;

import java.io.IOException;
import java.io.InputStream;
import java.util.Map;

/**
 * 爬蟲工具類
 * Created by dyw on 2017/9/1.
 */
public class CrawlerUtils {
    /**
     * http請求設置消息頭
     *
     * @param httpMethod http請求方法
     */
    private static void setHead(HttpMethod httpMethod) {
        setHead(httpMethod, null);
    }

    /**
     * http請求設置自定義消息頭
     *
     * @param httpMethod http請求方法
     * @param map        消息頭
     */
    private static void setHead(HttpMethod httpMethod, Map<String, String> map) {
        //判斷是否傳入自定義消息頭
        if (null != map && map.size() > 0) {
            map.keySet().forEach(key -> httpMethod.setRequestHeader(key, map.get(key)));
        }
        //公共消息頭(不同的網站消息頭內容不一致)
        httpMethod.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36");
        httpMethod.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
        httpMethod.setRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
    }


    /**
     * post方法設置登陸用戶信息
     *
     * @param postMethod post方法
     * @param loginInfo  消息體
     */
    private static void setBody(PostMethod postMethod, NameValuePair[] loginInfo) {
        postMethod.setRequestBody(loginInfo);
    }

    /**
     * 獲取html內容轉成string輸出(get方法)
     *
     * @param url url鏈接
     * @return 整個網頁轉成String字符串
     */
    public static String get(String url) throws Exception {
        return get(url, null);
    }

    /**
     * 獲取html內容轉成string輸出(get方法)有自帶的消息頭
     *
     * @param url url鏈接
     * @param map 消息頭內容
     * @return 整個網頁轉成String字符串
     */
    public static String get(String url, Map<String, String> map) throws Exception {
        String html = null;
        HttpClient httpClient = new HttpClient();
        HttpMethod httpMethod = new GetMethod(url);
        setHead(httpMethod, map);
        int status = httpClient.executeMethod(httpMethod);
        if (status == HttpStatus.SC_OK) {
            html = httpMethod.getResponseBodyAsString();
        }
        return html;
    }

    /**
     * 登陸方法,獲取cookie(post方法)
     *
     * @param url url鏈接
     * @return cookie
     */
    public static String post(String url, NameValuePair[] loginInfo) throws Exception {
        HttpClient httpClient = new HttpClient();
        // 模擬登陸,按實際服務器端要求選用Post請求方式
        PostMethod postMethod = new PostMethod(url);
        setHead(postMethod);
        setBody(postMethod, loginInfo);
        httpClient.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
        httpClient.executeMethod(postMethod);
        // 獲得登陸後的 Cookie
        Cookie[] cookies = httpClient.getState().getCookies();
        StringBuffer cookie = new StringBuffer();
        for (Cookie c : cookies) {
            cookie.append(c.toString() + ";");
        }
        return cookie.toString();
    }

    /**
     * 獲取文件流(get方法)
     *
     * @param urlStr url地址
     * @return InputStream
     */
    public static InputStream downLoadFromUrl(String urlStr) throws IOException {
        //通過httpclient來代替urlConnection
//        HttpHost proxy=new HttpHost("116.226.217.54", 9999);
        HttpClient httpClient = new HttpClient();
        HttpMethod httpMethod = new GetMethod(urlStr);
//        HostConfiguration hostConfiguration = new HostConfiguration();
//        hostConfiguration.setHost("116.226.217.54", 9999);
//        httpClient.setHostConfiguration(hostConfiguration);
        setHead(httpMethod);
        int status = httpClient.executeMethod(httpMethod);
        InputStream responseBodyAsStream = null;
        if (status == HttpStatus.SC_OK) {
            responseBodyAsStream = httpMethod.getResponseBodyAsStream();
        }
        return responseBodyAsStream;
    }
}

3、main主方法

package com.dyw.crawler.project;

import com.dyw.crawler.util.CrawlerUtils;
import org.apache.commons.httpclient.NameValuePair;

import java.util.HashMap;
import java.util.Map;

/**
 * 模擬登陸
 * Created by dyw on 2017/9/5.
 */
public class Project2 {

    public static void main(String[] args) {
        // 1   Url 開源中國網站登錄url
        String loginUrl = "https://www.oschina.net/action/user/hash_login?from=";
        //個人私信網站,登錄才能進入
        String dataUrl = "https://my.oschina.net/u/3673710/admin/inbox";
        // 設置登陸時要求的信息,用戶名和密碼
        NameValuePair[] loginInfo = {new NameValuePair("email", "賬號"),
                new NameValuePair("pwd", "密碼")};
        try {
            String cookie = CrawlerUtils.post(loginUrl, loginInfo);
            Map<String, String> map = new HashMap<>();
            map.put("Cookie", cookie);
            String html = CrawlerUtils.get(dataUrl, map);
            System.out.println(html);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

這樣我們就能獲取到私信網站內容。

具體代碼我上傳在github上,需要完整代碼的可以自己下載 https://github.com/dingyinwu81/crawler

如果有什麼代碼修改的建議,請給我留言唄! ☺☺☺

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章