Python | 分析 Ajax 爬取今日頭條街拍美圖

微信公衆號：一個優秀的廢人
如有問題或建議，請後臺留言，我會盡力解決你的問題。

前言

許久不見，我是阿狗。我的主業是 java web 開發，學 Python 只是單純地認爲這是一個風口，我不是豬，也想飛起來。所以，我選擇在業餘時間先學爲敬。今天給你們帶來的是今日頭條街拍美圖的爬取：分析 Ajax 爬取今日頭條街拍美圖。

環境

這次實戰採取的是 wi10 + python3.6 + PyCharm 的開發環境。另外，還用到了 requests 、urllib、hashlib、multiprocessing 這幾個庫。其中，requests 用於網絡請求；urllib 的 urlencode 模塊用於構造請求參數；hashlib 的 md5 模塊用於構建一個唯一的圖片名，防止重複，造成出錯；multiprocessing 的 pool 模塊用於開啓多線程，加快爬取速度。

思路

爬蟲之前，第一就是明確我們的爬取對象，也就是我的目的是爬取這個網站的什麼東西，那麼這裏我們爬取的是搜索結果前二十頁，每一頁每一項的組圖以及對應每組的標題。

首先是打開今日頭條網頁版首頁 https://www.toutiao.com/。在搜索框輸入街拍，之後打開開發者工具（在瀏覽器當前頁面按F12）分析網頁請求參數。

在開發者工具可以看見，在 Network 下的 All 選項卡中的請求是非常多的，我們無法分辨哪一個是真正的網頁 Ajax 請求。這時切換到 XHR 選項卡，這個選項卡里面出現的就是 Ajax 請求。那我們嘗試打開請求的參數以及返回的內容是否與頁面匹配。

點擊該請求，切換到 preview 選項下，這裏就是 chrome 開發者工具 json 格式化的該請求的返回結果。結果中有一個 data 字段，這個字段包含了當前頁面的所有美圖，展開第一個之後發現它的 title 字段內容就跟我們頁面上渲染出來的內容相匹配。而每一項都有一個 image_list 字段，這個字段包含了這一項的所有圖片。如上圖所示，title 表示第一項的標題，image_list 表示這一項的組圖。繼續展開 image_list 分析。

如上圖，看見 image_list 展開後的 url 就是我們要爬取的圖片所在。也就是說我們打開搜索頁面之後，還要獲取 image_list 裏面包含的 url ，再次訪問這些 url 才能得到我們想要的圖片，image_list 裏面的每一個 url 就代表該組內的每一張圖片。那請求參數是什麼呢？

點擊該請求 https://www.toutiao.com/search_content/?offset=0&format=json&keyword=%E8%A1%97%E6%8B%8D&autoload=true&count=20&cur_tab=1&from=search_tab，切換到 header 選項下，發現請求參數如圖，這是一個 get 請求，請求參數有 offset、 format、keyword、autoload、count、cur_tab、from；而每次下拉加載只有 offset 是變化的。每次請求 +20。也就是每次下拉請求當前頁面就會多加載出 20 組圖片。至此真相大白，我們要爬取的內容找到了，請求參數的規律也有了。下面就進入實戰演練。

加載單個 Ajax 請求

實現 get_page 方法用於加載單個 Ajax 請求，其中 offset 是變化的。所以把它當做參數傳遞進來。代碼如下：

def get_page(offset):
    #構造參數
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '1',
        'from': 'search_tab',
    }
    # headers, 僞裝成瀏覽器
    headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/68.0.3440.106 Safari/537.36'
    }
    url = 'https://www.toutiao.com/search_content/?' + urlencode(params)
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError:
        return None

其中要注意的就是，header 參數一定要加上。否則的話，瀏覽器會認爲你是非法請求從而報錯。header 在你當前請求的 header 窗口可以直接複製，不懂直接網上搜就懂了。

解析方法

實現一個解析方法，用於提取每條數據的 image_list 字段的每一張圖片鏈接，將圖片鏈接以及圖片所屬標題一併返回，此時可以構造一個生成器（不懂的，看前面的 Python 基礎文章，或者看菜鳥教程）。代碼如下：

def get_images(json):
    if json.get('data'):
        for item in json.get('data'):
            title = item.get('title', "nasus")
            images = item.get('image_list', [])
            for image in images:
                yield{
                  'image': 'http:'+image.get('url'),
                  'title': title
                }

這裏需要提一下的是第五行後面的 [ ]。這裏加上是防止某些 image_list 爲空類型，造成無法生成迭代器而報錯。

保存圖片

實現一個保存圖片的方法，其中 item 就是前面 get_images 方法返回的一個字典，在該方法中以該 item 的 title 來創建文件夾，然後請求該圖片鏈接，獲取其二進制數據並寫入文件。圖片名稱使用其內容的 md5 值防止重複。代碼如下：

def save_image(item):
    if not os.path.exists(item.get('title')):
        os.mkdir(item.get('title'))
    try:
        # headers, 僞裝成瀏覽器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/68.0.3440.106 Safari/537.36'
        }
        response = requests.get(item.get('image'), headers=headers)
        if response.status_code == 200:
            file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print('Already Download', file_path)
    except requests.ConnectionError:
        print('Fail to Save Image')

啓動函數

只需要構造一個 offset 數組，開啓多線程遍歷 offset ，提取圖片鏈接，訪問並將其下載即可。
代碼如下：

def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


GROUP_START = 1
GROUP_END = 20

if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()
    # for i in range(GROUP_END):
    #     main(i*GROUP_END)

這樣整個程序就完成了。運行之後發現每組街拍美圖都按標題分文件夾保存下來了。

最後附上完整代碼：

import requests
import os
from urllib.parse import urlencode
from hashlib import md5
from multiprocessing.pool import Pool

def get_page(offset):
    #構造參數
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '1',
        'from': 'search_tab',
    }
    # headers, 僞裝成瀏覽器
    headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/68.0.3440.106 Safari/537.36'
    }
    url = 'https://www.toutiao.com/search_content/?' + urlencode(params)
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()
    except requests.ConnectionError:
        return None

def get_images(json):
    if json.get('data'):
        for item in json.get('data'):
            title = item.get('title', "nasus")
            images = item.get('image_list', [])
            for image in images:
                yield{
                  'image': 'http:'+image.get('url'),
                  'title': title
                }

def save_image(item):
    if not os.path.exists(item.get('title')):
        os.mkdir(item.get('title'))
    try:
        # headers, 僞裝成瀏覽器
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                          'AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/68.0.3440.106 Safari/537.36'
        }
        response = requests.get(item.get('image'), headers=headers)
        if response.status_code == 200:
            file_path = '{0}/{1}.{2}'.format(item.get('title'), md5(response.content).hexdigest(), 'jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    f.write(response.content)
            else:
                print('Already Download', file_path)
    except requests.ConnectionError:
        print('Fail to Save Image')

def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)

GROUP_START = 1
GROUP_END = 20

if __name__ == '__main__':
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()
    # for i in range(GROUP_END):
    #     main(i*GROUP_END)

通過這篇文章，初步瞭解了 Ajax 的分析過程，以及 Ajax 的分頁模擬和圖片下載過程，代碼非常的簡單，但是也建議初學者自己動手實踐一下，雖然很簡單，但千萬不能有所見即所得的想法，有時你可能會遇到意想不到的坑，所謂大神也是踩坑、填坑不斷循環這個過程鍛鍊而來的。

最後，如果對 Python 、Java 感興趣請長按二維碼關注一波，我會努力帶給你們價值，讚賞就不必了，能力沒到，受之有愧。我會告訴你關注之後回覆爬蟲可以領取一份最新的爬蟲教學視頻嗎？

Python | 分析 Ajax 爬取今日頭條街拍美圖

前言

環境

思路

加載單個 Ajax 請求

解析方法

保存圖片

啓動函數

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Spring MVC 覆盤 | 工作原理及配置

【Linux系列】阿里雲服務器的使用及安裝mysql、tomcat、jdk 三件套

Maven 基礎（一） | 使用 Maven 的正確姿勢

Java 基礎（四）| IO 流之使用文件流的正確姿勢

Java 項目熱部署，節省構建時間的正確姿勢

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結