java爬蟲第一步(htmlunit)

首先初步認識htmlunit

htmlunit 是一款開源的java 頁面分析工具,讀取頁面後,可以有效的使用htmlunit分析頁面上的內容。項目可以模擬瀏覽器運行,被譽爲java瀏覽器的開源實現。是一個沒有界面的瀏覽器,運行速度迅速。是junit的擴展之一

 

採用的是Rhinojs引擎。模擬js運行

 

常規意義上,該項目可以用來進行頁面的測試工作,實現網頁自動化測試,(包括JS)

但是一般來說,在小型爬蟲項目中,這種框架十分常用,可以有效的分析出 dom的標籤,並且有效的運行頁面上的js以便得到一些需要執行JS才能得到的值。

 

我們研究的爬蟲應用中 僅僅用Httpclient+jsoup是不夠的,因爲有些頁面數據是js加載出來的 httpclient解析不出來,比如百度雲用戶頁面,淘寶頁面,連開源中國博客的博客內容都是Js加載或者處理的,所以httpClient是解析不出來的,這裏htmlunit就是一個很好的方案,內嵌js瀏覽器,模擬Js運行,把結果執行出來。這個就是我們想要的。所以搞爬蟲,htmlunit必須要掌握好;(摘錄自http://blog.java1234.com

項目具體組成(本次爬蟲採用的是maven,目前這個項目的maven是複製的,由於使用之前做過的,所以很多是不那麼重要的):

pom.xml

<?xml version="1.0" encoding="UTF-8"?>

 

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

<modelVersion>4.0.0</modelVersion>

 

<groupId>UserEchart</groupId>

<artifactId>fly</artifactId>

<version>1.0-SNAPSHOT</version>

<packaging>war</packaging>

 

<name>fly Maven Webapp</name>

<!-- FIXME change it to the project's website -->

<url>http://www.example.com</url>

 

<properties>

 

<project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>

<maven.compiler.encoding>UTF-8</maven.compiler.encoding>

 

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

<spring.version>5.0.2.RELEASE</spring.version>

<mybatis.version>3.4.5</mybatis.version>

</properties>

 

<dependencies>

 

<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->

<!-- alipay -->

<dependency>

<groupId>commons-codec</groupId>

<artifactId>commons-codec</artifactId>

<version>1.10</version>

</dependency>

<dependency>

<groupId>commons-configuration</groupId>

<artifactId>commons-configuration</artifactId>

<version>1.10</version>

</dependency>

<dependency>

<groupId>commons-logging</groupId>

<artifactId>commons-logging</artifactId>

<version>1.1.1</version>

</dependency>

<dependency>

<groupId>com.google.zxing</groupId>

<artifactId>core</artifactId>

<version>2.1</version>

</dependency>

<dependency>

<groupId>com.google.code.gson</groupId>

<artifactId>gson</artifactId>

<version>2.7</version>

</dependency>

<dependency>

<groupId>org.hamcrest</groupId>

<artifactId>hamcrest-core</artifactId>

<version>1.3</version>

</dependency>

 

<!-- Spring -->

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-context</artifactId>

<version>${spring.version}</version>

</dependency>

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-jdbc</artifactId>

<version>${spring.version}</version>

</dependency>

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-webmvc</artifactId>

<version>${spring.version}</version>

</dependency>

<dependency>

<groupId>commons-lang</groupId>

<artifactId>commons-lang</artifactId>

<version>2.6</version>

</dependency>

 

<!--<dependency>-->

<!--<groupId>net.sourceforge.jexcelapi</groupId>-->

<!--<artifactId>jxl</artifactId>-->

<!--<version>2.6.12</version>-->

<!--</dependency>-->

 

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-test</artifactId>

<version>${spring.version}</version>

<scope>test</scope>

</dependency>

<!-- mybatis -->

<dependency>

<groupId>org.mybatis</groupId>

<artifactId>mybatis</artifactId>

<version>${mybatis.version}</version>

</dependency>

<!-- mybatis-spring -->

<dependency>

<groupId>org.mybatis</groupId>

<artifactId>mybatis-spring</artifactId>

<version>1.3.0</version>

</dependency>

 

<!-- MySQL驅動 -->

<dependency>

<groupId>mysql</groupId>

<artifactId>mysql-connector-java</artifactId>

<version>5.1.38</version>

<scope>runtime</scope>

</dependency>

<!-- druid連接池 -->

<dependency>

<groupId>com.alibaba</groupId>

<artifactId>druid</artifactId>

<version>1.0.26</version>

</dependency>

 

<!-- aspectJ織入 -->

<dependency>

<groupId>org.aspectj</groupId>

<artifactId>aspectjweaver</artifactId>

<version>1.8.7</version>

</dependency>

 

<!-- JSON處理 -->

<dependency>

<groupId>com.fasterxml.jackson.core</groupId>

<artifactId>jackson-databind</artifactId>

<version>2.9.5</version>

</dependency>

<dependency>

<groupId>com.alibaba</groupId>

<artifactId>fastjson</artifactId>

<version>1.2.41</version>

</dependency>

 

<!-- JUnit4測試工具 -->

<dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

<version>4.12</version>

<scope>test</scope>

</dependency>

 

<!-- Servlet-API -->

<dependency>

<groupId>javax.servlet</groupId>

<artifactId>javax.servlet-api</artifactId>

<version>3.0.1</version>

<scope>provided</scope>

</dependency>

 

 

<!-- JSTL標籤庫 -->

<dependency>

<groupId>jstl</groupId>

<artifactId>jstl</artifactId>

<version>1.2</version>

</dependency>

 

 

<!-- 日誌 -->

<dependency>

<groupId>org.slf4j</groupId>

<artifactId>slf4j-api</artifactId>

<version>1.7.21</version>

</dependency>

<dependency>

<groupId>org.slf4j</groupId>

<artifactId>slf4j-log4j12</artifactId>

<version>1.7.21</version>

</dependency>

 

 

 

<!-- https://mvnrepository.com/artifact/com.alipay.sdk/alipay-sdk-java -->

<!-- https://mvnrepository.com/artifact/com.alipay.sdk/alipay-sdk-java -->

<dependency>

<groupId>com.alipay.sdk</groupId>

<artifactId>alipay-sdk-java</artifactId>

<version>3.0.0</version>

</dependency>

 

 

<!-- lombok插件 -->

<dependency>

<groupId>org.projectlombok</groupId>

<artifactId>lombok</artifactId>

<version>1.16.6</version>

<scope>compile</scope>

</dependency>

<dependency>

<groupId>org.testng</groupId>

<artifactId>testng</artifactId>

<version>RELEASE</version>

<scope>compile</scope>

</dependency>

<dependency>

<groupId>junit</groupId>

<artifactId>junit</artifactId>

<version>RELEASE</version>

<scope>compile</scope>

</dependency>

 

<dependency>

<groupId>org.testng</groupId>

<artifactId>testng</artifactId>

<version>RELEASE</version>

<scope>compile</scope>

</dependency>

 

<!-- https://mvnrepository.com/artifact/activation/activation -->

<dependency>

<groupId>activation</groupId>

<artifactId>activation</artifactId>

<version>1.0.2</version>

</dependency>

 

<!-- https://mvnrepository.com/artifact/org.apache.commons/commons-email -->

<dependency>

<groupId>org.apache.commons</groupId>

<artifactId>commons-email</artifactId>

<version>1.4</version>

</dependency>

 

<!-- https://mvnrepository.com/artifact/javax.mail/javax.mail-api -->

<dependency>

<groupId>javax.mail</groupId>

<artifactId>javax.mail-api</artifactId>

<version>1.4.4</version>

</dependency>

 

 

<dependency>

<groupId>org.springframework</groupId>

<artifactId>spring-test</artifactId>

<version>RELEASE</version>

<scope>compile</scope>

</dependency>

 

<!-- lucene -->

<dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-core</artifactId>

<version>4.10.4</version>

</dependency>

<!-- lucene高亮相關的包 -->

<dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-highlighter</artifactId>

<version>4.10.4</version>

</dependency>

<!-- IK -->

<dependency>

<groupId>com.janeluo</groupId>

<artifactId>ikanalyzer</artifactId>

<version>2012_u6</version>

</dependency>

<dependency>

<groupId>org.apache.lucene</groupId>

<artifactId>lucene-queryparser</artifactId>

<version>4.10.4</version>

</dependency>

 

<!--使用mybatis分頁插件-->

<!-- https://mvnrepository.com/artifact/com.github.pagehelper/pagehelper -->

<dependency>

<groupId>com.github.pagehelper</groupId>

<artifactId>pagehelper</artifactId>

<version>5.1.8</version>

</dependency>

 

<!-- jsoupHtmlUnit爬蟲包-->

<!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->

<dependency>

<groupId>org.jsoup</groupId>

<artifactId>jsoup</artifactId>

<version>1.11.3</version>

</dependency>

<!-- https://mvnrepository.com/artifact/net.sourceforge.htmlunit/htmlunit -->

<dependency>

<groupId>net.sourceforge.htmlunit</groupId>

<artifactId>htmlunit</artifactId>

<version>2.33</version>

</dependency>

 

<!--獲取請求頭的工具類-->

<!-- https://mvnrepository.com/artifact/eu.bitwalker/UserAgentUtils -->

<dependency>

<groupId>eu.bitwalker</groupId>

<artifactId>UserAgentUtils</artifactId>

<version>1.21</version>

</dependency>

 

<!-- ehcache核心jar -->

<dependency>

<groupId>net.sf.ehcache</groupId>

<artifactId>ehcache-core</artifactId>

<version>2.6.11</version>

</dependency>

<!-- MyBatisehcache整合jar -->

<dependency>

<groupId>org.mybatis.caches</groupId>

<artifactId>mybatis-ehcache</artifactId>

<version>1.1.0</version>

</dependency>

<dependency>

<groupId>cglib</groupId>

<artifactId>cglib</artifactId>

<version>2.2.2</version>

</dependency>

 

<!--ECharts工具類-->

<dependency>

<groupId>com.github.abel533</groupId>

<artifactId>ECharts</artifactId>

<version>3.0.0.2</version>

</dependency>

 

<dependency>

<groupId>commons-dbutils</groupId>

<artifactId>commons-dbutils</artifactId>

<version>1.7</version>

</dependency>

<!-- 地理位置工具類-->

<dependency>

<groupId>com.maxmind.geoip2</groupId>

<artifactId>geoip2</artifactId>

<version>2.8.1</version>

</dependency>

 

<!--標籤轉換工具類-->

<dependency>

<groupId>com.0opslab</groupId>

<artifactId>opslabJutil</artifactId>

<version>1.0.8</version>

</dependency>

 

</dependencies>

<build>

 

<plugins>

 

<!--Java編譯器插件 -->

<plugin>

<groupId>org.apache.maven.plugins</groupId>

<artifactId>maven-compiler-plugin</artifactId>

<configuration>

<source>1.8</source>

<target>1.8</target>

<encoding>UTF-8</encoding>

<compilerArguments>

<extdirs>${project.basedir}/src/main/webapp/WEB-INF/lib</extdirs>

</compilerArguments>

</configuration>

</plugin>

<plugin>

<groupId>org.mybatis.generator</groupId>

<artifactId>mybatis-generator-maven-plugin</artifactId>

<version>1.3.2</version>

<configuration>

<configurationFile>src/main/resources/generatorConfig.xml

</configurationFile>

<verbose>true</verbose>

<overwrite>false</overwrite>

</configuration>

<dependencies>

<dependency>

<groupId>mysql</groupId>

<artifactId>mysql-connector-java</artifactId>

<version>5.1.38</version>

<scope>runtime</scope>

</dependency>

</dependencies>

</plugin>

 

<!-- 添加一個tomcat插件 -->

<plugin>

<groupId>org.apache.tomcat.maven</groupId>

<artifactId>tomcat7-maven-plugin</artifactId>

<configuration>

<!-- tomcat啓動的端口 -->

<port>80</port>

<!-- 應用的上下文路徑 -->

<path>/</path>

<!--解決中文亂碼-->

<uriEncoding>UTF-8</uriEncoding>

</configuration>

<!-- 在打包的時候運行這個容器 -->

<executions>

<execution>

<phase>package</phase>

<goals>

<goal>run</goal>

</goals>

</execution>

</executions>

</plugin>

</plugins>

</build>

</project>

 

 

 

SpiderTest(重點重點重點)

package test;

 

import com.gargoylesoftware.htmlunit.WebClient;

import com.gargoylesoftware.htmlunit.html.HtmlPage;

import org.junit.Test;

import org.junit.runner.RunWith;

import org.springframework.test.context.ContextConfiguration;

import org.springframework.test.context.junit4.SpringJUnit4ClassRunner;

 

import java.io.IOException;

 

@RunWith(SpringJUnit4ClassRunner.class)

@ContextConfiguration("classpath:application.xml")

public class SpiderTest {

@Test

public void spider(){

WebClient webClient=new WebClient();//實例化客戶端

try {

HtmlPage htmlPage=webClient.getPage("http://baidu.com");//抓取頁面

System.out.println("網頁html"+htmlPage.asXml());//抓取網頁html

System.out.println("網頁文本"+htmlPage.asText());

} catch (IOException e) {

e.printStackTrace();

}finally {

webClient.close();//關閉客戶端

}

}

}

 

 

顯示的記過如下:

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章