繼上一篇博文 爬蟲記錄(2)——簡單爬取一個頁面的圖片並保存 ,今天我們通過httpclient模擬表單登錄開源中國,獲取cookie,然後通過cookie訪問個人私信頁面。
1、準備工作
模擬表單登錄,首先需要知道登錄的url,以及登錄表單的字段。這裏我們圖一中故意輸入一個錯誤的用戶名和密碼,然後通過查看圖二中的network中,發現登錄的url是https://www.oschina.net/action/user/hash_login?from=,字段是 賬號爲email , 密碼是pwd
而且密碼通過相應的處理,我們先不管他,反正輸入的是正確的密碼,直接拿圖中字符串即可。
圖一
圖二
2、修改相應的爬蟲工具類CrawlerUtils
增加post方法,插入消息體。也修改相應的get方法,增加插入cookie的方法。
package com.dyw.crawler.util;
import org.apache.commons.httpclient.Cookie;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.HttpStatus;
import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.httpclient.cookie.CookiePolicy;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;
import java.io.IOException;
import java.io.InputStream;
import java.util.Map;
/**
* 爬蟲工具類
* Created by dyw on 2017/9/1.
*/
public class CrawlerUtils {
/**
* http請求設置消息頭
*
* @param httpMethod http請求方法
*/
private static void setHead(HttpMethod httpMethod) {
setHead(httpMethod, null);
}
/**
* http請求設置自定義消息頭
*
* @param httpMethod http請求方法
* @param map 消息頭
*/
private static void setHead(HttpMethod httpMethod, Map<String, String> map) {
//判斷是否傳入自定義消息頭
if (null != map && map.size() > 0) {
map.keySet().forEach(key -> httpMethod.setRequestHeader(key, map.get(key)));
}
//公共消息頭(不同的網站消息頭內容不一致)
httpMethod.setRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36");
httpMethod.setRequestHeader("Content-Type", "application/x-www-form-urlencoded");
httpMethod.setRequestHeader("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
}
/**
* post方法設置登陸用戶信息
*
* @param postMethod post方法
* @param loginInfo 消息體
*/
private static void setBody(PostMethod postMethod, NameValuePair[] loginInfo) {
postMethod.setRequestBody(loginInfo);
}
/**
* 獲取html內容轉成string輸出(get方法)
*
* @param url url鏈接
* @return 整個網頁轉成String字符串
*/
public static String get(String url) throws Exception {
return get(url, null);
}
/**
* 獲取html內容轉成string輸出(get方法)有自帶的消息頭
*
* @param url url鏈接
* @param map 消息頭內容
* @return 整個網頁轉成String字符串
*/
public static String get(String url, Map<String, String> map) throws Exception {
String html = null;
HttpClient httpClient = new HttpClient();
HttpMethod httpMethod = new GetMethod(url);
setHead(httpMethod, map);
int status = httpClient.executeMethod(httpMethod);
if (status == HttpStatus.SC_OK) {
html = httpMethod.getResponseBodyAsString();
}
return html;
}
/**
* 登陸方法,獲取cookie(post方法)
*
* @param url url鏈接
* @return cookie
*/
public static String post(String url, NameValuePair[] loginInfo) throws Exception {
HttpClient httpClient = new HttpClient();
// 模擬登陸,按實際服務器端要求選用Post請求方式
PostMethod postMethod = new PostMethod(url);
setHead(postMethod);
setBody(postMethod, loginInfo);
httpClient.getParams().setCookiePolicy(CookiePolicy.BROWSER_COMPATIBILITY);
httpClient.executeMethod(postMethod);
// 獲得登陸後的 Cookie
Cookie[] cookies = httpClient.getState().getCookies();
StringBuffer cookie = new StringBuffer();
for (Cookie c : cookies) {
cookie.append(c.toString() + ";");
}
return cookie.toString();
}
/**
* 獲取文件流(get方法)
*
* @param urlStr url地址
* @return InputStream
*/
public static InputStream downLoadFromUrl(String urlStr) throws IOException {
//通過httpclient來代替urlConnection
// HttpHost proxy=new HttpHost("116.226.217.54", 9999);
HttpClient httpClient = new HttpClient();
HttpMethod httpMethod = new GetMethod(urlStr);
// HostConfiguration hostConfiguration = new HostConfiguration();
// hostConfiguration.setHost("116.226.217.54", 9999);
// httpClient.setHostConfiguration(hostConfiguration);
setHead(httpMethod);
int status = httpClient.executeMethod(httpMethod);
InputStream responseBodyAsStream = null;
if (status == HttpStatus.SC_OK) {
responseBodyAsStream = httpMethod.getResponseBodyAsStream();
}
return responseBodyAsStream;
}
}
3、main主方法
package com.dyw.crawler.project;
import com.dyw.crawler.util.CrawlerUtils;
import org.apache.commons.httpclient.NameValuePair;
import java.util.HashMap;
import java.util.Map;
/**
* 模擬登陸
* Created by dyw on 2017/9/5.
*/
public class Project2 {
public static void main(String[] args) {
// 1 Url 開源中國網站登錄url
String loginUrl = "https://www.oschina.net/action/user/hash_login?from=";
//個人私信網站,登錄才能進入
String dataUrl = "https://my.oschina.net/u/3673710/admin/inbox";
// 設置登陸時要求的信息,用戶名和密碼
NameValuePair[] loginInfo = {new NameValuePair("email", "賬號"),
new NameValuePair("pwd", "密碼")};
try {
String cookie = CrawlerUtils.post(loginUrl, loginInfo);
Map<String, String> map = new HashMap<>();
map.put("Cookie", cookie);
String html = CrawlerUtils.get(dataUrl, map);
System.out.println(html);
} catch (Exception e) {
e.printStackTrace();
}
}
}
這樣我們就能獲取到私信網站內容。
具體代碼我上傳在github上,需要完整代碼的可以自己下載 https://github.com/dingyinwu81/crawler
如果有什麼代碼修改的建議,請給我留言唄! ☺☺☺