一、題目需求:
編寫程序實現如下功能:
提示用戶輸入待爬取的百度貼吧的主題
提示用戶輸入待爬取的起始頁碼和終止頁碼
將爬取的指定頁碼對應的網頁保存到本地磁盤指定目錄
要求爬取和保存文件使用自定義函數
二、問題分析,規律查找
以百度貼吧搜索“喬丹”一詞爲例
從這兩幅圖對比可以發現一些規律
1、百度貼吧搜索關鍵詞網頁URL前綴爲:https://tieba.baidu.com/f?
2、百度貼吧搜索關鍵詞的參數信息爲kw=key_word
3、百度貼吧頁碼的偏移量參數值pn每頁的偏移差值爲100-50=50(pn=50第一頁,pn=100第二頁,以此類推...)
獲取這些分析規律之後,就可以動手寫代碼爬取需求的頁面內容了
三、Python爬蟲源碼
# -*- coding: utf-8 -*-
"""
Created on Mon Jun 24 22:21:12 2019
@author: UnderMask
"""
#爬取百度貼吧內指定搜索主題跟頁面內的HTML數據信息
import datetime
import requests
import random#隨機添加/修改User-Agent
ualist = [#一些可用的瀏覽器名稱
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
def save_txt(filePath,content):#將網頁內容content存入指定的路徑文件中filePath
with open(filePath,"w+",encoding='utf-8') as file_write:
file_write.write(content)
def climb(theme,start_page,end_page):#給定貼吧搜索主題、爬取起始頁碼、爬取終止頁碼
url="https://tieba.baidu.com/f?"#百度貼吧未給定搜索主題時的網頁URL前綴
for i in range(start_page,end_page+1):#從起始頁碼爬取到終止頁碼
ua = random.choice(ualist)#從上面可用瀏覽器中隨機挑選模擬一個,模擬真實的瀏覽器訪問頁面
headers = {"Connection":"Keep-alive", "User-Agent":ua}#設定headers
offset=(i-1)*50#可用看出百度貼吧的頁面規律,一個頁碼pn的值偏移量爲50(第一頁爲0,第二頁的pn=50,第三頁的pn=100...)
re=requests.get(url,params={"kw":theme,"pn":str(offset)},headers=headers)#給定請求時的參數、搜索主題,跟頁碼pn的偏移量值
re.encoding="utf-8"#設定utf-8編碼方式,防止亂碼
savePath="C:\\Users\\UnderMask\\Desktop\\新建文件夾\\"+"BaiDuTieBa&&theme="+str(theme)+"&&pageNum="+str(i)+"&&datetime="+datetime.datetime.now().strftime('%Y-%m-%d %H-%M-%S')+".txt"
save_txt(savePath,re.text)
if __name__=="__main__":
theme=input("請輸入想要爬取的百度貼吧主題:")
start_page=int(input("請輸入想要爬取的起始頁碼:"))
end_page=int(input("請輸入想要爬取的終止頁碼:"))
start_time=datetime.datetime.now()
print("[" + start_time.strftime('%Y-%m-%d %H:%M:%S') + "]>>>"+"爬取開始!")
climb(theme,start_page,end_page)
end_time=datetime.datetime.now()
print("[" + end_time.strftime('%Y-%m-%d %H:%M:%S') + "]>>>"+"爬取結束!")
print("總用時:",(end_time-start_time))
存儲路徑需要自行改寫(我這是絕對路徑),然後文件命名瞎搞的,分頁碼存儲爲一個.txt文本文件
四、運行效果
1、控制檯
2、存儲文件夾
3、例如第一頁的存儲頁面.txt文件