前言

最近看網上的分享文章中，關於淘寶網站，不少朋友都是獲取的商品列表數據，我個人其實對顧客的評論也是比較感興趣的，所以寫了一個簡單的爬蟲獲取淘寶的評論。需要注意的是，淘寶的反爬是很嚴的，需要登陸，對頻率和速度也有限制，所以在爬取的量比較大的時候建議使用代理池和多cookie。

流程

這裏，我們以一個隨機商品爲例，流程如下：

根據商品詳情頁鏈接獲取真實的評論請求url
請求評論url，接收響應
解析數據，獲取評論總數和評論數據
存儲數據到本地
根據評論總數構造循環翻頁，重複2、3、4步

淘寶的評論是通過JS動態渲染出來的，並沒有在初始請求的網頁源碼中，所以我們要找到發送新請求的url，這個並不難找。

右鍵f12進入檢查，此時沒有發送評論請求，評論並沒有加載；當點擊網頁的評論按鈕時，有新的請求被髮送了，“feedRateList”開頭的新請求就是我們要找的。從preview中可以看出，這是一個json，裏面包含了評論和其它的數據。這裏可以把整個json拿出來，但裏面有很多其它keys，很多我並不知道含義，所以我只提取了自己感興趣的數據。

詳細代碼

import re
import requests
import json
import math
import time
#import pymongo


class TaoBaoComment:
    def __init__(self):
        self.target_url = 'https://item.taobao.com/item.htm?spm=a219r.lm874.14.58.7cd87156tmSUG2&id=579824132873&ns=1&abbucket=18#detail'
        self.raw_url = 'https://rate.taobao.com/feedRateList.htm?'
        self.post_url = 'https://login.taobao.com/member/login.jhtml?'
        self.headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36',
                'Cookie': 'cookie'#登陸後獲取，也可以使用session登陸
                  }
        self.item_id = self.parse_url(self.target_url)
#        self.session = requests.session()
        
    def parse_url(self, url):
        pattern = re.compile('.*?id=([0-9]+)&.*?', re.S)
        result = re.findall(pattern, url)
        return result[0]
    
#    def login_in(self):
#        data = {'TPL_username': 'xxx', 'TPL_password': 'xxx'}
#        post_resp = self.session.post(self.post_url, headers=self.headers, data=data)
#        print(post_resp.status_code)
    
    def get_page(self, pagenum):        
        params = {
            'auctionNumId': self.item_id,
            'currentPageNum': pagenum
            }
        resp = requests.get(self.raw_url, params=params, headers=self.headers)
        print(resp.status_code)
#        resp.encoding = resp.apparent_encoding
        content = re.sub(r'[\n\r()]', '', resp.content.decode())
        return content
    
    def get_detail(self, content):
        if content:
           page=json.loads(content)
           if page and ('comments' in page.keys()):
                total=page['total']
                comments = self.get_comment(page)
                return comments, total
                    
    def get_comment(self, page):
        if page and ('comments' in page.keys()):
            detail_list = []
            for comment in page['comments']:
                details = {'date': comment['date']}
                details['num'] = comment['buyAmount']
                
                if comment['bidPriceMoney']:
                    details['amount']=comment['bidPriceMoney']['amount']
                
                if comment['auction']['sku']:
                    details['sku']=comment['auction']['sku'].replace('&nbsp', '')
                    
                details['comment']=comment['content']
                if comment['photos']:
                    details['photos']=[i['url'].replace('_400x400.jpg', '') for i in comment['photos']]
                                
                if comment['append']:
                    details['extra_comment']=comment['append']['content']
                    if comment['append']['photos']:
                        details['extra_photos']=[i['url'].replace('_400x400.jpg', '') for i in comment['append']['photos']]
                    details['dayAfterConfirm']=comment['append']['dayAfterConfirm']
                    
                detail_list.append(details)
            return detail_list
        
    def on_save(self, content):
        if content:
            with open('E:/spiders/taobao_comment/comment.txt', 'a', encoding='utf-8') as f:
                f.write(json.dumps(content, ensure_ascii=False))
                f.write('\n')
    
    def run(self):
#        self.login_in()
        content = self.get_page(1)
        comments, total = self.get_detail(content)
        for comment in comments:
            self.on_save(comment)
        pagenum=math.ceil(total/20)
        n = 2
        while pagenum >= n:
            content = self.get_page(2)
            time.sleep(5)
            comments, _ = self.get_detail(content)
            for comment in comments:
                self.on_save(comment)
            print('page {} saved'.format(n))
            n += 1


if  __name__ == '__main__':
    comment = TaoBaoComment()
    comment.run()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲之三 —— 淘寶評論

前言

流程

詳細代碼

成都二手房長啥樣 —— 基於鏈家數據

信用評分卡模型 —— 基於Lending Club數據

Human Resources Analytics -- Kaggle Dataset

遊戲付費金額 —— 基於DC遊戲數據（Brutal Age）

python爬蟲之七 —— 鏈家二手房

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結