python正則表達式簡單爬蟲入門+案例（爬取貓眼電影TOP榜）

原創

2020-06-20 20:36

用正則表達式實現一個簡單的小爬蟲

常用方法介紹

1、導入工具包

import requests
#導入請求模塊
from flask import json
#導入json模塊
from requests.exceptions import RequestException
#異常捕捉模塊
import re
#導入正則模塊
from multiprocessing import Pool
#導入進程模塊

2、獲取頁面

response =requests.get(url)
url:當前需要爬取的鏈接
requests.get()獲得頁面

3、if response.status_code ==200:

#驗證狀態碼
response.status_code：獲取狀態碼
200：表示正常，連接成功

4、response.text：得到頁面內容

例如：response =requests.get(url)

5、except RequestException:捕捉異常

try:
    ...
except RequestException:
    ...

6、pat = re.compile():編譯正則表達式

#簡單的正則基礎

7、items =re.findall(pat,html)

pat：編譯過的正則表達式
html：用response.text得到的頁面內容
re.findall()：找到所有匹配的內容

8、打開文件

with open('result','a',encoding='utf-8')as f
with as :打開自動閉合的文件並設立對象f進行操作
result:文件名字
a:打開方式是續寫模式
encoding:編碼格式

9、寫入文件

 f.write(json.dumps(conrent,ensure_ascii =False)+'\n')
 json.dumps:以json方式寫入

10、簡單進程

pool = Pool()
#創建進程池
pool.map(func,[i*10 for i in range(10)])
[i*10 for i in range(10)]：生成器，生成0到9的數字乘以10的結果，生成一個列表爲[0,10,20....]
func:函數
map：將函數作用於列表每一個元素

11、yield:生成器

案例：用上面的工具完成爬去貓眼電影TOP榜

#__author:PL.Li
#導入需要使用的模塊
import requests
from flask import json
from requests.exceptions import RequestException
import re
from multiprocessing import Pool
#嘗試連接獲取頁面
def get_response(url):
    try:
        response =requests.get(url)
        if response.status_code ==200:
            return response.text
        return None
    except RequestException:
        return None
#正則匹配需要的內容
def re_one_page(html):
#超級長的正則表達式進行匹配，匹配到的是個集合。  
    pat =re.compile('<dd>.*?board-index.*?">(/d+?)</i>.*?data-src="(.*?).*?name"><a.*?">(.*?)"class=.*?class="star">'
                    '(.*?)</p>.*?releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>',re.S)
   #用迭代進行異步操作  
    items =re.findall(pat,html)
    for item in items:
        yield {
            'index':item[0],
            'image':item[1],
            'title':item[2],
            'actor':item[3].strip()[3:],
            'time':item[4].strip(),
            'score':item[5]+item[6]

        }
#保存寫入文件
def write_file(conrent):
    with open('result','a',encoding='utf-8')as f:
        f.write(json.dumps(conrent,ensure_ascii =False)+'\n')
        f.close()
#配置啓動函數
def main(offset):
    url ='http://maoyan.com/board'+str(offset)
    html=get_response(url)
    for item in re_one_page(html):
        write_file(item)
#使用多進程加速一秒完成
if __name__ == '__main__':
        pool = Pool()
        pool.map(main,[i*10 for i in range(10)])

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python正則表達式簡單爬蟲入門+案例（爬取貓眼電影TOP榜）

用正則表達式實現一個簡單的小爬蟲

常用方法介紹

1、導入工具包

2、獲取頁面

3、if response.status_code ==200:

4、response.text：得到頁面內容

5、except RequestException:捕捉異常

6、pat = re.compile():編譯正則表達式

7、items =re.findall(pat,html)

8、打開文件

9、寫入文件

10、簡單進程

11、yield:生成器

案例：用上面的工具完成爬去貓眼電影TOP榜

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

python用字符串操作20行代碼簡單爬蟲入門+案例（爬取一章《三體》小說）

3、flask第三站-模板

What the Fuck?年薪30萬的碼農不如公務員

python正則表達式簡單爬蟲入門+案例（爬取貓眼電影TOP榜）

Celery: Unrecoverable error: AttributeError(“Can't pickle local object 'Pool.init.

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結