WebMagic+Selenium爬蟲實戰展示

原創

凉拌海蜇丝

2020-02-28 08:01

之前那個爬蟲項目只是說了大概上的配置和一些使用方法，有很多同學留言問具體怎麼使用，這裏直接簡單操作一番。

首先電腦要配置好maven環境，配置好JDK，配置好對應瀏覽器版本的驅動，然後去我的gitHub拉取那個spider項目，具體詳情看：https://blog.csdn.net/whiteBearClimb/article/details/103711670

下面是完整的pom.xml 依賴：

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 https://maven.apache.org/xsd/maven-4.0.0.xsd">
	<modelVersion>4.0.0</modelVersion>
	<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>2.2.4.RELEASE</version>
		<relativePath/> <!-- lookup parent from repository -->
	</parent>
	<groupId>com.example</groupId>
	<artifactId>demo</artifactId>
	<version>0.0.1-SNAPSHOT</version>
	<name>demo</name>
	<description>Demo project for Spring Boot</description>

	<properties>
		<java.version>1.8</java.version>
	</properties>

	<dependencies>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
			<exclusions>
				<exclusion>
					<groupId>org.junit.vintage</groupId>
					<artifactId>junit-vintage-engine</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		<dependency>
			<groupId>us.codecraft</groupId>
			<artifactId>webmagic-core</artifactId>
			<version>0.7.3</version>
		</dependency>
		<dependency>
			<groupId>us.codecraft</groupId>
			<artifactId>webmagic-extension</artifactId>
			<version>0.7.3</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/org.mybatis/mybatis -->
		<dependency>
			<groupId>org.mybatis</groupId>
			<artifactId>mybatis</artifactId>
			<version>3.5.2</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/mysql/mysql-connector-java -->
		<dependency>
			<groupId>mysql</groupId>
			<artifactId>mysql-connector-java</artifactId>
			<version>5.1.47</version>
		</dependency>
		<dependency>
			<groupId>org.mybatis</groupId>
			<artifactId>mybatis-spring</artifactId>
			<version>1.2.1</version>
		</dependency>

		<dependency>
			<groupId>org.seleniumhq.selenium</groupId>
			<artifactId>selenium-java</artifactId>
			<version>3.8.1</version>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>org.seleniumhq.selenium</groupId>
			<artifactId>selenium-firefox-driver</artifactId>
		</dependency>
		<dependency>
			<groupId>org.seleniumhq.selenium</groupId>
			<artifactId>selenium-chrome-driver</artifactId>
		</dependency>

	</dependencies>

	<build>
		<plugins>
			<plugin>
				<groupId>org.springframework.boot</groupId>
				<artifactId>spring-boot-maven-plugin</artifactId>
			</plugin>
		</plugins>
	</build>

</project>

下面把代碼整理好，沒有爆紅就可以用了。

先來看看要抓取的某小說頁面（紅袖添香）做個沒有惡意的小示範：https://www.hongxiu.com/chapter/15253682605357404/41012879017098795

找到要抓去的正文主體部分，再去修改SpiderProcessor文件裏面的代碼：

drive.get(“XXXXX”); 就是你要抓去的頁面地址，這裏我簡單展示所以寫死了，你可以從controller那邊傳過來。

driver.findElement(By.className(“read-content”)); 就是上面網站定位到的頁面抓取正文部分的class 標籤名。

執行跑一下看看效果：

這個小警告是驅動版本與瀏覽器版本不太兼容引起的。但是內容還是能抓回來。

由於我沒有設置無界面，所以瀏覽器驅動啓動了界面。

好了可以看到成功抓取回來了，具體的其他操作更復雜的操作可以參照網上的資料去嘗試，還是比較簡單的。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

WebMagic+Selenium爬蟲實戰展示

之前那個爬蟲項目只是說了大概上的配置和一些使用方法，有很多同學留言問具體怎麼使用，這裏直接簡單操作一番。

首先電腦要配置好maven環境，配置好JDK，配置好對應瀏覽器版本的驅動，然後去我的gitHub拉取那個spider項目，具體詳情看：https://blog.csdn.net/whiteBearClimb/article/details/103711670

下面是完整的pom.xml 依賴：

下面把代碼整理好，沒有爆紅就可以用了。

先來看看要抓取的某小說頁面（紅袖添香）做個沒有惡意的小示範：https://www.hongxiu.com/chapter/15253682605357404/41012879017098795

找到要抓去的正文主體部分，再去修改SpiderProcessor文件裏面的代碼：

drive.get(“XXXXX”); 就是你要抓去的頁面地址，這裏我簡單展示所以寫死了，你可以從controller那邊傳過來。

driver.findElement(By.className(“read-content”)); 就是上面網站定位到的頁面抓取正文部分的class 標籤名。

執行跑一下看看效果：

這個小警告是驅動版本與瀏覽器版本不太兼容引起的。但是內容還是能抓回來。

由於我沒有設置無界面，所以瀏覽器驅動啓動了界面。

好了可以看到成功抓取回來了，具體的其他操作更復雜的操作可以參照網上的資料去嘗試，還是比較簡單的。

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

RestTemplate的簡單使用

Linux服務器（Centos7）安裝node

Spring啓動類源碼學習小記

雙親委派模式和類加載器（自定義類加載器）

重寫與重載 / 動靜態分派調用（JVM字節碼底層逐步解析，喫雞例子簡單易懂）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結