python爬蟲爬取動態數據

原創

2018-08-28 13:17

python的requests庫只能爬取靜態頁面，爬取不了動態加載的頁面。但是通過對頁面的ajax請求的分析，可以解決一部分動態內容的爬取。這篇文章以爬取百度圖片中的動物圖片爲目標，講解怎麼爬取js動態渲染的內容。

1.首先我們要做的就是抓包。這裏我用的是charles抓包工具。百度動物圖片
url=“https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&oq=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&rsp=-1”

所以在抓包工具裏，先看 https://image.baidu.com 這部分的包。可以看到，這部分包裏面，search下面的那個 url和我們訪問的地址完全是一樣的，但是它的response卻包含了js代碼。這部分我們看不出來什麼。就接着往下看。

2.當在動物圖片首頁往下滑動頁面，想看到更多的時候，更多的包出現了。從圖片可以看到，下滑頁面後得到的是一連串json數據。在data裏面，可以看到thumbURL等字樣。它的值是一個url。然後我們可以猜想，這個url就可能是圖片的鏈接。

3.我們打開一個瀏覽器頁面，訪問thumbURL="https://ss1.bdstatic.com/70cFvXSh_Q1YnxGkpoWK1HF6hhy/it/u=1968180540,4118301545&fm=27&gp=0.jpg" 發現是下面這貨（眼神也是呆萌）。

然後在主頁面裏我們找到了它：

4.根據前面的分析，就可以知道，請求
URL="https://image.baidu.com/search/acjsontn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf8&oe=utf8&adpicid=&st=-1&z=&ic=0&word=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&pn=30&rn=30&gsm=1e&1531038037275="會返回圖片的鏈接。如果可以構造這個url，那就可以得到這些圖片的下載地址。但是在構造這個url之前，我們還要做一件事情，就是打開另一個瀏覽器，訪問這個地址。這樣做的目的是檢測這個地址是可以直接訪問，還是會和cookie或者其他一些東西有關。我之前用的搜狗瀏覽器打開的百度動物圖片，現在用fire fox訪問這個地址。幸好，我們得到了全部數據（這裏亂碼了，但是數據是一樣的）。由此可以推測，這個url是公開的。

5.現在要做的就是構造這個url，得到圖片地址。通過分析，構造這個URL的各部分值組成了request，每次的request只有pn，rn，gsm和最後那個數字是變化的。通過控制變量法，改變這四個值，然後我發現。pn是每次ajax請求的圖片數量。這個值默認是30.我也就不去改它；rn是所有出現過的圖片數量；gsm是rn的16進制；最後的數字的值我將它設置爲 int(time.time()*1000)。這樣改它構造url，解析它返回的json數據，便可以得到圖片鏈接了。

下面是全部的代碼：

import requests
import re
import time
import json
import os
from contextlib import closing
from datetime import datetime
import time

session = requests.Session()
requests.packages.urllib3.disable_warnings()
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0"}


def get_img_url():
    url_dict = []
    per_page_img_num = 30
    for i in range(1,2):
        all_img_num = per_page_img_num * i
        all_img_num_hex = re.match(r'0x(\w+)',hex(all_img_num)).group(1)
        time_s = int(time.time())
        url = "https://image.baidu.com/search/acjson?tn=resultjson_com&ipn=rj&ct=201326592&is=&fp=result&queryWord=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=-1&z=&ic=0&word=%E5%8A%A8%E7%89%A9%E5%9B%BE%E7%89%87&s=&se=&tab=&width=&height=&face=0&istype=2&qc=&nc=1&fr=&pn={}&rn={}&gsm={}&{}=".format(per_page_img_num,all_img_num,all_img_num_hex,time_s)

        response = session.get(url,headers=headers,verify=False)
        json_data = json.loads(response.text)
        for num in range(0,20):
            url_dict.append(json_data["data"][num]["thumbURL"])
    return url_dict

def get_img(url_dict):
    file_path = "animal_images"
    chunk_size = 1024
    img_id = 0
    for url in url_dict:
        download_img_url = url
        if file_path not in os.listdir():
            os.makedirs(file_path)

        with closing(session.get(download_img_url, headers=headers, verify=False, stream=True)) as response:
            img_id += 1
            file = '{}/{}.jpg'.format(file_path, img_id)
            if os.path.exists(file):
                print("圖片{}已存在,跳過本次下載".format(file))
            else:
                try:
                    start_time = datetime.now()
                    with open(file, 'ab+') as f:
                        for chunk in response.iter_content(chunk_size=chunk_size):
                            f.write(chunk)
                            f.flush()
                    end_time = datetime.now()
                    sec = (end_time - start_time).seconds
                    print("下載圖片{}完成,耗時:{}s".format(file, sec))
                except:
                    if os.path.exists(file):
                        os.remove(file)
                    print("下載圖片{}失敗".format(file))


if __name__ == '__main__':

    url_dict = get_img_url()
    get_img(url_dict) 

參考博文：
https://blog.csdn.net/qq_24076135/article/details/78077659
http://brucedone.com/archives/58

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲爬取動態數據

用Python解析Json格式出錯

ALchemy學習-從mysql數據庫讀取數據顯示到頁面

C++程序設計（下）第一週

MongoDB學習——索引

STL適配器-第四周學習筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

python爬蟲 爬取動態數據

python爬蟲爬取動態數據