前端有時候會遇到項目臨時需要網上收集數據的情況,什麼方案是簡單易懂、長期可用的呢,當然是用瀏覽器終端測試單元做爬蟲是最方便的啦,將平時工作中的測試程序進行簡單的修改,然後配合爬蟲代理,就可以馬上開始數據採集,是不是很方便呀。
HtmlUnit是java下的一款無頭瀏覽器方案,通過相應的API模擬HTML協議,可以請求頁面,提交表單,打開鏈接等等操作,完全模擬用戶終端。支持複雜的JavaScript、AJAX庫,可以模擬多種瀏覽器,包括Chrome,Firefox或IE等。
下面提供一個簡單的demo,通過調用爬蟲代理訪問IP查詢網站,如果將目標網站修改爲需要採集的數據鏈接,即可獲取相應的數據,再加上數據分析模塊就可以基本使用,示例如下:
package htmlunit;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.client.CredentialsProvider;
import org.apache.http.impl.client.BasicCredentialsProvider;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class HtmlunitDemo {
// 代理服務器(產品官網 www.16yun.cn)
final static String proxyHost = "t.16yun.cn";
final static Integer proxyPort = 31111;
// 代理驗證信息
final static String proxyUser = "USERNAME";
final static String proxyPass = "PASSWORD";
public static void main(String[] args) {
CredentialsProvider credsProvider = new BasicCredentialsProvider();
credsProvider.setCredentials(
new AuthScope(proxyHost, proxyPort),
new UsernamePasswordCredentials(proxyUser, proxyPass));
WebClient webClient = new WebClient(BrowserVersion.CHROME,proxyHost, proxyPort);
webClient.setCredentialsProvider(credsProvider);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setActiveXNative(false);
webClient.getOptions().setCssEnabled(false);
HtmlPage page = null;
try {
page = webClient.getPage("http://httpbin.org/ip");
} catch (Exception e) {
e.printStackTrace();
} finally {
webClient.close();
}
webClient.waitForBackgroundJavaScript(30000);
String pageXml = page.asXml();
System.out.println(pageXml);
}
}