Python爬蟲實戰之多線程爬取貓眼電影Top100

前言

本次爬取貓眼電影採用requests庫做網絡請求，正則表達式做HTML網頁解析，多線程方式進行爬取，最後數據序列化成json格式數據並保存。

爬取地址：http://maoyan.com/board/4
爬取信息：排名、封面圖片、演員、上映時間、評分

分析

url分析

從圖片中可以看出 url的變化規律爲：

http://maoyan.com/board/4?offset=頁數*10

當然，頁數是從0開始的。

html分析

從圖片中可以看出每一部電影區塊都是由一個dd標籤組成，而我們所要提取的信息都在這個dd標籤裏面。所以得出一個貪婪正則表達式爲：

<dd>.*?>(\d+).*?data-src="(.*?)".*?"name"><a.*?>(.*?)</a>.*?class="star">(.*?).*?releasetime">(.*?).*?integer">(.*?).*?fraction">(.*?).*?</dd>

其中()裏面的就是每一個我們要提交的字段信息

代碼解析

導入我們待會需要使用的python庫，註釋裏面標註了每個庫對應作用

import threading #多線程
import requests, re, json#網絡請求、正則、數據序列化
from requests.exceptions import RequestException #請求異常
from multiprocessing import Process #多進程
from multiprocessing import Pool #進程池

編寫一個方法獲取網絡HTML代碼,代碼註釋寫明瞭每句話的意思

def get_one_page(url): #傳入url地址
    try:
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36"
        } #定義一個字典形式的請求頭，加入User-Agent爲key的參數，用來模擬瀏覽器請求
        response = requests.get(url, headers=headers) #請求url
        with open('result.html', 'w', encoding='utf-8') as f: #把url以文件形式寫入到本地，利於方便看html代碼來分析
            f.write(response.text)
            f.close()
        if response.status_code == 200: #如果網路請求成功，返回html代碼
            return response.text
        return None #否則返回空
    except RequestException: #捕獲請求異常
        print("異常")
        return None

編寫一個解析網頁html代碼的生成器方法，把解析出來的每一項數據以字典類型，可迭代對象返回：

def parse_html(html):
    pattern = re.compile(
        '<dd>.*?>(\d+)</i>.*?data-src="(.*?)".*?"name"><a.*?>(.*?)</a>.*?class="star">(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',
        re.S) #生成一個正則表達式對象
    items = re.findall(pattern, html) #通過正則表達式對象和網頁html代碼匹配到我們需要的數據，賦值給items
    # print(items)
    for item in items: #每一項數據用yield返回一個字典出去，形成一個生成器
        yield {
            "index": item[0],
            "image": item[1],
            "title": item[2],
            "actor": item[3].strip()[3:],
            "time": item[4].strip()[5:],
            "score": item[5] + item[6]
        }

編寫一個寫入數據到本地的方法,用來把解析後的數據以序列化方式保存：


def write_to_file(content):
    with open('maoyantop100.txt', 'a', encoding='utf-8') as f:
        f.write(json.dumps(content, ensure_ascii=False) + '\n')
        f.close()

定義一個主方法來執行對網頁的獲取，解析：

list=[] #定義一個列表，用於把每一個字典(也就是每一部電影的信息)存放進去
def main(offset):
    url = "http://maoyan.com/board/4?offset="+str(offset)
    html = get_one_page(url)

    for item in parse_html(html):
        list.append(item)

最後在執行文件入口執行以上方法：
- 以for循環方式爬取十頁的網頁內容

# for循環方式來爬取
    # for i in range(10):
    #     main(i*10)

- 以多進程的方式爬取十頁的網頁內容

 #進程方式執行 起了十個進程來爬取，爲了保證抓取完成也就是進程執行完畢，加入p.join()方法來保證進程執行完，當前程序才退出，但是這樣會使爬取效率降低
    # processes = [Process(target=main,args=(i*10,) ) for i in range(10)]  # 用列表生成式 生成10個線程
    # for p in processes:
    #     p.start()  # 啓動剛剛創建的10個進程
    #     p.join()  # 進程結束 主進程也就是當前的程序 才結束
    # print('master process finished...')
    #

- 以多進程+進程池的方式爬取十頁的網頁內容

# #進程池 多進程的方式來爬取
    # def end(arg):# 單個進程結束執行的方法
    #     print("processes finish")
    # pool = Pool(5)
    # for i in range(10):
    #     pool.apply(main,args=(i*10,)) #串行執行
    #     # pool.apply_async(func=main, args=(i*10,), callback=end)  # 並行執行，callback，是進程結束後的回調，是主進程調用的回調。
    # pool.close()  # 需先close，再join
    # pool.join()  # join: 等待子進程，主線程再結束
    # print('main finished...')

- 以多線程的方式爬取十頁的網頁內容

    #多線程方式爬取 啓動十個線程來爬取，爬取速度及其快，可以實現秒獲取
    threads=[ threading.Thread(target=main,args=(i*10,)) for i in range(10)] #用列表生成式 生成10個線程
    for t in threads: #啓動剛剛創建的10個線程
        t.start()
        t.join() #加入join後 ，子線程結束，主線程才結束，運行速度會變慢
    print(json.dumps(list))
    write_to_file(list)

最終效果

完整代碼地址：https://github.com/domain9065/maoyantop100

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲實戰之多線程爬取貓眼電影Top100

前言

分析

url分析

html分析

代碼解析

最終效果

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Python爬蟲學習之 Request

Python爬蟲之PyQuery

Java-Web系列之Spring-Web

一文讀懂java多線程

linux 下安裝mysql

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結