Java使用正則表達式抓取Bing首頁每日圖片

原創

2020-06-28 10:52

Java學習到了正則表達式，總想做點有用的東西，這兩天想給電腦換壁紙，看到Bing每天的主頁圖片挺好看的，就尋思着抓下來。
第一步就是分析主頁的結構了這個Bing的主頁圖片直接使用小箭頭抓是抓不到的，在Network的Img裏我們可以找到圖片所在處：

把鏈接copy下來，在Element裏面搜索我們邊可以看到鏈接是在一個JS腳本里面的，這個時候就比較清楚我們要怎麼搞了，

鏈接所在的那一片弄出來就是這個樣子了

g_img={url: “/az/hprichbg/rb/LoxodontaAfricana_ZH-CN10434704249_1920x1080.jpg”}
把這個東東里的鏈接搞出來加上 http://cn.bing.com 就是我們需要的圖片鏈接了，那麼這個正則表達式寫出來就是

"g_img=\\{url: \"([\\w_\\-/]+?\\.jpg)\""

我開始找的時候把後面的”}”加上去發現找不到鏈接，只好使用這個了，在找到鏈接後我們就可以獲取到圖片的二進制內容，寫入Java的文件中保存起來了。這裏又學習了一點Java的文件知識，如何判斷一個文件是否已經存在，新建文件，寫入二進制等東西。
具體的代碼就是下面了，也不是很多，就幾十行

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.net.URL;
import java.nio.file.FileAlreadyExistsException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class GetBingPicture {
    public static void main(String[] args) throws Exception {
        GetBingPicture getBingPicture = new GetBingPicture();
        String  home = "http://cn.bing.com";
        // 獲取鏈接
        String url = getBingPicture.GetUrl(home);
        // 保存圖片
        getBingPicture.SavePicture(home + url);
    }

    private String GetUrl(String home_url) throws Exception {
        InputStream is = new URL(home_url).openStream();
        byte[] buff = new byte[1024];
        StringBuilder builder = new StringBuilder();
        // 得到界面的字符串
        while (is.read(buff, 0, buff.length) > 0) {
            // 需要使用String的編碼解碼
            builder.append(new String(buff, "UTF-8"));
        }
        is.close();
        // 開始正則匹配
        Matcher matcher = Pattern.compile("g_img=\\{url: \"([\\w_\\-/]+?\\.jpg)\"").matcher(builder.toString());
        // 找鏈接
        if (matcher.find()) {
            System.out.println("Find the url: " + matcher.group(1));
            return matcher.group(1);
        } else {
            throw new Exception("Not found the url");
        }
    }
    // 保存函數
    private void SavePicture(String url) throws IOException {
        // 打開鏈接
        InputStream is = new URL(url).openStream();
        // 鏈接處理一下得到名字
        int start = url.lastIndexOf("/") + 1;
        int end = url.indexOf("_");
        // 拼接出名字，substring函數前閉後開
        String name = url.substring(start, end) + ".jpg";
        File file = new File(name);
        // 判斷是否已經存在
        if (file.exists())  {
            throw new  FileAlreadyExistsException(name + " has existed");
        } else {
            // 創建文件
            file.createNewFile();
            FileOutputStream fileOutputStream = new FileOutputStream(file);
            byte[] buff = new byte[1024];
            int len = 0;
            while ((len = is.read(buff, 0, buff.length)) > 0) {
                fileOutputStream.write(buff, 0, len);
            }
            System.out.println(name + " was downloaded successfully");
            // 關掉才能保存到磁盤裏
            fileOutputStream.close();
        }
        is.close();
    }
}

其實我發現由於整個頁面中只有一個g_img={url: ,那麼直接字符串搜索我估計也是可以的，這個方法我用python實現了一下，代碼短了不少，上鍊接啦python字符串查找實現。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Java使用正則表達式抓取Bing首頁每日圖片

Java編程思想之吸血鬼數

PAT甲級1002 A + B

c++常量方法相關內容（轉載）

PAT甲級1007. Maximum Subsequence Sum (25)

Leetcode-033 Search in Rotated Sorted Array

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結