使用Java+Jsoup實現網絡爬蟲

僅供學習交流

需求分析：

爬取的資源：爬取某招聘網站的Java崗位的招聘信息，並保存到數據庫。

代碼示例：

1.準備工作

①引入依賴（pom.xml）

		 <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        
 		<!--jsoup-->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.2</version>
        </dependency>

        <!--客戶端編程工具包-->
        <dependency>
            <groupId>org.apache.httpcomponents</groupId>
            <artifactId>httpclient</artifactId>
            <version>4.5.6</version>
        </dependency>

        <!--IO操作工具類庫-->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.6</version>
        </dependency>

        <!--MySql驅動-->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.46</version>
        </dependency>

        <!--druid連接池-->
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>druid</artifactId>
            <version>1.1.10</version>
        </dependency>

        <!--JDBCTemplate-->
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-core</artifactId>
            <version>5.0.8.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-beans</artifactId>
            <version>5.0.8.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-jdbc</artifactId>
            <version>5.0.8.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>org.springframework</groupId>
            <artifactId>spring-tx</artifactId>
            <version>5.0.8.RELEASE</version>
        </dependency>
        <dependency>
            <groupId>commons-logging</groupId>
            <artifactId>commons-logging</artifactId>
            <version>1.2</version>
        </dependency>

②準備數據庫連接池與配置文件

通過數據庫連接池獲取數據庫連接的操作封裝成一個工具類（JDBCUtils.java）教程

數據庫連接池(JDBCUtils)

public class JDBCUtils {
    //使用Druid數據庫連接池技術獲取數據庫連接
    private static DataSource createDataSource;
    static{
        try {
            Properties pros = new Properties();
            InputStream is = JDBCUtils.class.getResourceAsStream("/druid.properties");
            //InputStream is = ClassLoader.getSystemClassLoader().getResourceAsStream("druid.properties");
            pros.load(is);
            createDataSource = DruidDataSourceFactory.createDataSource(pros);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
    //得到連接的方法
    public static Connection getConnection1() throws SQLException{
        return createDataSource.getConnection();
    }
    //得到數據源的方法
    public static DataSource getDataSource() {
        return createDataSource;
    }
}

Druid數據庫連接池配置文件(druid.properties)

url=jdbc:mysql:///recruitmentspider
username=root
password=root
driverClassName=com.mysql.jdbc.Driver
initialSize=10
maxActive=10

2.爬取資源

使用Jsoup解析HTML進行數據收集並把數據存儲到數據庫（SpiderLagouTest）

public class SpiderLagouTest {
    int substring=1;
    @Test
    public  void test() throws IOException {
        String url="https://www.lagou.com/zhaopin/Java/"+substring+"/";
        //爬取招聘信息
        fetchRecruitmentData(url);
    }

    private  void fetchRecruitmentData(String url) throws IOException {
        try {
            //過10秒在爬取（如果是持續爬取，爬取五六頁就爬取不到數據了）
            Thread.sleep(10000);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
        //1.讀取url，得到Document
        Document document = Jsoup.connect(url).get();
        //2.得到招聘信息Elements，循環處理每個Element
        Elements elements = document.select(".item_con_list .con_list_item");
        for (Element element : elements) {
            //得到公司名
            String companyName = element.select(".company_name a").text();
            System.out.println("公司名稱:"+companyName);
            //得到工作地址
            String workAddress = element.select(".add em").text();
            System.out.println("工作地址:"+workAddress);
            //得到招聘職位
            String tip=element.select(".p_top h3").text();
            System.out.println("招聘職位:"+tip);

            //得到工資，工作經驗，學歷要求
            String money_bot = element.select(".p_bot").text();// 得到的money_b爲：15k-25k 經驗3-5年 / 本科
            System.out.println(money_bot);
            //public String substring(int beginIndex,int endIndex)：返回一個新字符串，它是此字符串的一個子字符串。該子字符串從指定的 beginIndex 處開始，直到索引 endIndex - 1 處的字符。因此，該子字符串的長度爲 endIndex-beginIndex。
            String money = money_bot.substring(0,money_bot.indexOf(" "));
            System.out.println("工資範圍:"+money);
            //public String substring(int beginIndex)：返回一個新的字符串，它是此字符串的一個子字符串。該子字符串從指定索引處的字符開始，直到此字符串末尾。
            //public int indexOf(String str) 返回指定子字符串str在此字符串中第一次出現處的索引。
            String workExperience = money_bot.substring(money_bot.indexOf(" ")+1,money_bot.indexOf("/"));
            System.out.println("工作經驗:"+workExperience);
            String education = money_bot.substring(money_bot.indexOf("/")+2);
            System.out.println("學歷要求:"+education);

            //得到行業領域 融資階段 公司規模
            String synopsis = element.select(".industry").text(); //得到的synopsis爲：移動互聯網,硬件 / D輪及以上 / 2000人以上
                //行業領域
            String industryfield = synopsis.substring(0 ,synopsis.indexOf("/"));
            System.out.println("行業領域:"+industryfield);
                //融資階段
            String financingStage = synopsis.substring(synopsis.indexOf("/")+2,synopsis.lastIndexOf("/"));
            System.out.println("融資階段:"+financingStage);
                //公司規模
            String companySize = synopsis.substring(synopsis.lastIndexOf("/") + 2);
            System.out.println("公司規模:"+companySize);

                 //得到技術或福利標籤
            String skill = element.select(".list_item_bot .li_b_l").text();
            System.out.println("職位描述或福利標籤:"+skill);

                //得到福利信息
            String welfare = element.select(".li_b_r").text();
            System.out.println("職位福利:"+welfare);

                //得到企業圖片
            String src = element.select(".com_logo  img").attr("src");
                    //獲取到的src爲：//www.lgstatic.com/thumbnail_120x120/i/image/M00/A5/6B/Cgp3O1ir8wOAJzPbAAIHeppEuoE288.png
            String path= fetchImage("http:" + src);
            System.out.println("圖片保存路徑:"+path);

            //存儲到數據庫
            JdbcTemplate jdbcTemplate = new JdbcTemplate(JDBCUtils.getDataSource());
            String sql="INSERT INTO lagou_java2 (id,companyName,workAddress,tip,money,workExperience,education,industryfield,financingStage,companySize,skill,welfare,path) VALUES (null,?,?,?,?,?,?,?,?,?,?,?,?);";
            jdbcTemplate.update(sql,companyName,workAddress,tip,money,workExperience,education,industryfield,financingStage,companySize,skill,welfare,path);
            System.out.println("---------------------");
        }

        //3.得到下一頁的url
             //通過瀏覽器開發者工具查看到下一頁的鏈接地址：https://www.lagou.com/zhaopin/Java/2/
        if(substring<10){
             substring = Integer.parseInt(url.substring(url.lastIndexOf("/") - 1, url.lastIndexOf("/")))+1;
             System.out.println(substring+"<10" );
        }else if(substring>=10&&substring<100){
            substring = Integer.parseInt(url.substring(url.lastIndexOf("/") - 2, url.lastIndexOf("/")))+1;
            System.out.println(substring+">=10&&"+substring+"<100");
        }else if(substring>100){
            substring = Integer.parseInt(url.substring(url.lastIndexOf("/") - 3, url.lastIndexOf("/")))+1;
        }
        System.out.println("開始爬取第"+substring+"頁");
        String href="https://www.lagou.com/zhaopin/Java/"+substring +"/";
        System.out.println(href);
        System.out.println("============================================================================");
        fetchRecruitmentData(href);
    }


    private static String fetchImage(String src) throws IOException {
        // 1.創建一個瀏覽器對象
        CloseableHttpClient client = HttpClients.createDefault();
        //2.創建請求信息，設置請求的地址
        HttpGet get = new HttpGet(src);
        //3.使用瀏覽器發送請求，把get請求發送，並得到響應結果
        CloseableHttpResponse response = client.execute(get);

        //4.判斷是否是正常響應
            //文件存儲路徑與文件名
                //    src ———> http://www.lgstatic.com/thumbnail_120x120/i/image/M00/A5/6B/Cgp3O1ir8wOAJzPbAAIHeppEuoE288.png
        String localPath="I:\\testSpider\\"+src.substring(src.lastIndexOf("/")+1);
        if (response.getStatusLine().getStatusCode() == 200) {
            //5. 獲取響應的內容（響應體對象）
            HttpEntity entity = response.getEntity();
            //6. 獲取響應體內容的輸入流（響應體裏是圖片的二進制數據，使用輸入流讀取數據）
            InputStream inputStream = entity.getContent();
            OutputStream outputStream = null;
            try {
                //7. 創建一個輸出流
                outputStream = new FileOutputStream(localPath);
                //8. 把輸入流數據寫到輸出流
                org.apache.commons.io.IOUtils.copy(inputStream, outputStream);
            } catch (FileNotFoundException e) {
                System.out.println("src= "+src+" 無法保存圖片");
            }finally {
                //9. 關閉流
                inputStream.close();
                if(outputStream!=null){
                    outputStream.close();
                }
            }
        }
        //10. 結束響應
        response.close();
        return localPath;
    }
}

3.爬取效果與數據處理

控制檯輸出：

刷新查看數據庫中的數據：

把數據庫中的數據導出爲excel：

遇到的問題：
持續爬取，爬取到六頁後就爬取不到數據了，開始以爲是網址的參數索引問題，打出日誌發現索引沒有問題，下一頁的網址也能訪問，但是通過程序爬取就是爬取不到。後來想到是不是訪問太頻繁，被關進“小黑屋”了？？？，於是在每次爬取下一頁前休眠5秒，還是不行，爬取六頁後還是爬取不到數據。覺得應該也不是這個問題，又折騰了很久…最後感覺還是訪問太頻繁的問題，然後把休眠時間改爲10秒。。。。。成功爬取了30頁招聘數據。

存在的問題：
如在爬取Java崗位的招聘信息時，有30頁招聘數據，爬取完30頁後，程序還會繼續爬取，需手動停止。

使用Java+Jsoup實現網絡爬蟲

目錄

需求分析：

代碼示例：

1.準備工作

2.爬取資源

3.爬取效果與數據處理

工作中用到的腳本合集

24-5-18 X

Spring JdbcTemplate配置實現CRUD

Web應用部署到Linux無法顯示動態驗證碼問題

使用Java+Jsoup實現網絡爬蟲

日期時間相關API

JdbcTemplate實現CRUD操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結