前言
最近看網上的分享文章中,關於淘寶網站,不少朋友都是獲取的商品列表數據,我個人其實對顧客的評論也是比較感興趣的,所以寫了一個簡單的爬蟲獲取淘寶的評論。需要注意的是,淘寶的反爬是很嚴的,需要登陸,對頻率和速度也有限制,所以在爬取的量比較大的時候建議使用代理池和多cookie。
流程
這裏,我們以一個隨機商品爲例,流程如下:
- 根據商品詳情頁鏈接獲取真實的評論請求url
- 請求評論url,接收響應
- 解析數據,獲取評論總數和評論數據
- 存儲數據到本地
- 根據評論總數構造循環翻頁,重複2、3、4步
淘寶的評論是通過JS動態渲染出來的,並沒有在初始請求的網頁源碼中,所以我們要找到發送新請求的url,這個並不難找。
右鍵f12進入檢查,此時沒有發送評論請求,評論並沒有加載;當點擊網頁的評論按鈕時,有新的請求被髮送了,“feedRateList”開頭的新請求就是我們要找的。從preview中可以看出,這是一個json,裏面包含了評論和其它的數據。這裏可以把整個json拿出來,但裏面有很多其它keys,很多我並不知道含義,所以我只提取了自己感興趣的數據。
詳細代碼
import re
import requests
import json
import math
import time
#import pymongo
class TaoBaoComment:
def __init__(self):
self.target_url = 'https://item.taobao.com/item.htm?spm=a219r.lm874.14.58.7cd87156tmSUG2&id=579824132873&ns=1&abbucket=18#detail'
self.raw_url = 'https://rate.taobao.com/feedRateList.htm?'
self.post_url = 'https://login.taobao.com/member/login.jhtml?'
self.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
'Cookie': 'cookie'#登陸後獲取,也可以使用session登陸
}
self.item_id = self.parse_url(self.target_url)
# self.session = requests.session()
def parse_url(self, url):
pattern = re.compile('.*?id=([0-9]+)&.*?', re.S)
result = re.findall(pattern, url)
return result[0]
# def login_in(self):
# data = {'TPL_username': 'xxx', 'TPL_password': 'xxx'}
# post_resp = self.session.post(self.post_url, headers=self.headers, data=data)
# print(post_resp.status_code)
def get_page(self, pagenum):
params = {
'auctionNumId': self.item_id,
'currentPageNum': pagenum
}
resp = requests.get(self.raw_url, params=params, headers=self.headers)
print(resp.status_code)
# resp.encoding = resp.apparent_encoding
content = re.sub(r'[\n\r()]', '', resp.content.decode())
return content
def get_detail(self, content):
if content:
page=json.loads(content)
if page and ('comments' in page.keys()):
total=page['total']
comments = self.get_comment(page)
return comments, total
def get_comment(self, page):
if page and ('comments' in page.keys()):
detail_list = []
for comment in page['comments']:
details = {'date': comment['date']}
details['num'] = comment['buyAmount']
if comment['bidPriceMoney']:
details['amount']=comment['bidPriceMoney']['amount']
if comment['auction']['sku']:
details['sku']=comment['auction']['sku'].replace(' ', '')
details['comment']=comment['content']
if comment['photos']:
details['photos']=[i['url'].replace('_400x400.jpg', '') for i in comment['photos']]
if comment['append']:
details['extra_comment']=comment['append']['content']
if comment['append']['photos']:
details['extra_photos']=[i['url'].replace('_400x400.jpg', '') for i in comment['append']['photos']]
details['dayAfterConfirm']=comment['append']['dayAfterConfirm']
detail_list.append(details)
return detail_list
def on_save(self, content):
if content:
with open('E:/spiders/taobao_comment/comment.txt', 'a', encoding='utf-8') as f:
f.write(json.dumps(content, ensure_ascii=False))
f.write('\n')
def run(self):
# self.login_in()
content = self.get_page(1)
comments, total = self.get_detail(content)
for comment in comments:
self.on_save(comment)
pagenum=math.ceil(total/20)
n = 2
while pagenum >= n:
content = self.get_page(2)
time.sleep(5)
comments, _ = self.get_detail(content)
for comment in comments:
self.on_save(comment)
print('page {} saved'.format(n))
n += 1
if __name__ == '__main__':
comment = TaoBaoComment()
comment.run()