JavaSE小實踐1：Java爬取鬥圖網站的所有表情包

跟朋友聊天總會用到大量表情包，有些人甚至專門收集各種各樣的表情包，看看誰能打敗誰。今天我就用java爬取了一個鬥圖網站上的所有表情包，用來充實自己的表情包庫。代碼邏輯有可能並不完美，哈哈，也花了我幾個小時才完成呢。
下載完所有圖片，總共有225M.
思路：主要通過解析頁面的源代碼來獲取圖片的URL地址，然後通過圖片地址下載圖片到本地，所以要學會使用瀏覽器進行分析。

所用jar包：jsoup-1.8.1.jar
網站首頁：https://doutushe.com/portal/index/index/p/1
瀏覽器：Chrome

1，獲取網頁源代碼

    /**
     * 獲取網頁源代碼
     * @author Augustu
     * @param url 網頁地址
     * @param encoding 網頁編碼
     * @return    網頁源代碼
     */
    public static String getUrlResource(String url,String encoding) {
        //網頁源代碼，用String這個容器記錄
        String htmlResource = "";
        //記錄讀取網頁的每一行數據
        String temp = null;
        try {
            //1,找到網站地址
            URL theUrl = new URL(url);
            //2，建立起與網站的連接
            URLConnection urlConnection = theUrl.openConnection();
            //3,創建輸入流，此處讀取的是網頁的源代碼
            InputStreamReader isr = new InputStreamReader(urlConnection.getInputStream(),encoding);
            //4，對輸入流進行緩衝，加快讀取速度
            BufferedReader reader = new BufferedReader(isr);
            //5，一行一行讀取源代碼，存到htmlResource中
            while((temp = reader.readLine()) != null) {
                htmlResource += temp;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
        return htmlResource;
    }

2，獲取頁面所有組圖片的UrL地址

    /**
     * 獲取頁面所有組圖片的UrL地址
     * @author Augustu
     * @param context 每個頁面的urL
     * @return 獲取頁面所有組圖片的UrL地址
     */
    public static String findPictureUrl(String context) {
        String temp = "";//暫時存儲得到的每個url
        String pictureUrl = "";//得到所有URL
        //1，Jsoup將讀取的網頁源代碼解析爲Html文檔，便可以使用Jsoup的方法操作html元素了，就像javascript一樣
        Document document = Jsoup.parse(context);
        //2，觀察網頁源代碼，發現每組圖片都連接到了另一個URL地址，這個a標籤的class爲“link-2”
        Elements groupUrl = document.getElementsByClass("link-2");
        //3,遍歷每個a標籤，得到href
        for(Element ele: groupUrl) {
            //此處我發現每次Url都輸出兩次，也沒找到原因，就用此方法先解決他
            if(ele.attr("href") == temp) {
                continue;
            }
            temp = ele.attr("href");
            //4，將所有URL存入String中，並使用空格分開，便於後面分割
            //本來我使用“|”分隔開來，分割的結果竟然是每個字符都分開了
            pictureUrl += "https://doutushe.com"+ele.attr("href")+" ";
        }
        return pictureUrl;
    }

3，下載單張圖片

    /**
     * 下載單張圖片
     * @param picturl 圖片地址
     * @param filePath    下載路徑
     * @param fileName    下載名
     */
    public static void downPicture(String picturl,String filePath,String fileName) {
        FileOutputStream fos = null;//輸出文件流
        BufferedOutputStream bos = null;//緩衝輸出
        File file = null;//創建文件對象
        File dir = new File(filePath);//創建文件保存目錄
        Connection.Response response;
        try {
            //1，Jsoup連接地址，得到響應流，ignoreContentType表示忽略網頁類型，如果不加會報錯（默認只支持文本），因爲我們頁面是圖片
            response = Jsoup.connect(picturl).ignoreContentType(true).execute();
            //2,將頁面內容按字節輸出
            byte[] img = response.bodyAsBytes();
            //3，寫入本地文件中
            //判斷文件目錄是否存在,
            if(!dir.exists() ){
                dir.mkdir();//創建文件夾
            }
            file = new File(filePath+"\\"+fileName);//創建文件
            fos = new FileOutputStream(file);
            bos = new BufferedOutputStream(fos);
            bos.write(img);//寫入本地
        } catch (IOException e) {
            e.printStackTrace();
        }finally{
            //4,釋放資源
            if(bos!=null){
                try {
                    bos.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if(fos!=null){
                try {
                    fos.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }
        
    }

4，下載所有圖片

    /**
     * 下載所有圖片
     * @author Augustu
     * @param pictureUrl 每組圖片url
     */
    public static void downallPicture(String pictureUrl,String downLoadPath) {
        String picturl = "";
        String pictureName ="";//
        String[] pictureUrlArry = pictureUrl.split(" ");//圖片組的url
        for(int i=0;i<pictureUrlArry.length;i++) {
            //遍歷得到每組圖片的url
            String pictureHtml = getUrlResource(pictureUrlArry[i],"utf-8");
            Document document = Jsoup.parse(pictureHtml);
            //得到該組圖片的分類名稱
            String dir =  document.getElementsByTag("blockquote").first().child(0).text();
            //該標籤包含所有圖片url
            Elements elements = document.getElementsByClass("lazy");
            for(Element ele: elements) {
                //得到每張圖片url
                picturl = ele.attr("data-original");
                //觀察源代碼，發現獲取的圖片地址多了/themes/doutushe/Public/assets/images/doutushe-erweima.jpg，將其刪除
                if(picturl.equals("/themes/doutushe/Public/assets/images/doutushe-erweima.jpg")) {
                    continue;
                }
                //得到每張圖片的名字，別忘了加後綴
                pictureName = ele.attr("title")+".gif";
                //下載該圖片
                downPicture(picturl,downLoadPath+"\\"+dir,pictureName);
            }
        }
    }

5，主函數

    public static void main(String[] args) {
        String context = "";
        //觀察源代碼，發現共有28個頁面
        for(int i=1;i<=28;i++) {
            //獲取每個頁面
            context = getUrlResource("https://doutushe.com/portal/index/index/p/"+i+"","utf-8");
            //獲取該頁面所有組圖片的url
            String pictureUrl = findPictureUrl(context);
            downallPicture(pictureUrl,"E:\\image\\表情包");
        }
        
    }

JavaSE小實踐1：Java爬取鬥圖網站的所有表情包

1，獲取網頁源代碼

2，獲取頁面所有組圖片的UrL地址

3，下載單張圖片

4，下載所有圖片

5，主函數

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

Java IO學習一：File類

MySQL小實踐一：快速插入1000萬條數據到MySQL數據庫中

Java基礎知識儲備一：Java的值傳遞和引用傳遞

Redis學習筆記三：Redis的數據類型

Redis學習筆記二：使用Jedis簡單操作reids數據庫

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結