網絡爬蟲 | 京東全站數據採集(類目、店鋪、商品、評論)——基於Python中Scrapy框架

1.定義採集數據的存儲結構

  • 【存儲結構說明】
  • class CategoriesItem(Item):存儲京東類目信息
  • class ProductsItem(Item):存儲京東商品信息
  • class ShopItem(Item):存儲京東店鋪信息
  • class CommentSummaryItem(Item):存儲京東每個商品的評論概況信息
  • class CommentItem(Item):存儲京東每個商品的評論基本信息
  • class CommentImageItem(Item):存儲京東每個商品中每條評論的圖像信息
  • 說明:類中所定義字段可依據具體採集要求或response內容進行調整

【items.py程序】

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field


class CategoriesItem(Item):
    """
    存儲京東類目信息
    """
    name = Field()    # 商品三級類目名稱
    url = Field()     # 商品三級類目對應url
    _id = Field()     # 商品類目對應id[一級id,二級id,三級id]


class ProductsItem(Item):
    """
    存儲京東商品信息
    """
    name = Field()                  # 商品名稱
    url = Field()                   # 商品url[用於商品主圖提取]
    _id = Field()                   # 商品sku
    category = Field()              # 商品三級類目
    description = Field()           # 商品描述
    shopId = Field()                # 商品所在店鋪id(名稱)
    commentCount = Field()          # 商品評價總數=CommentSummaryItem.commentCount
    # goodComment = Field()           # 商品好評數
    # generalComment = Field()        # 商品中評數
    # poolComment = Field()           # 商品差評數
    # favourableDesc1 = Field()       # 商品優惠描述1
    # favourableDesc2 = Field()       # 商品優惠描述2
    # venderId = Field()              # 供應商id
    # reallyPrice = Field()           # 商品現價
    # originalPrice = Field()         # 商品原價


class ShopItem(Item):
    _id = Field()                   # 店鋪url
    shopName = Field()              # 店鋪名稱
    shopItemScore = Field()         # 店鋪[商品評價]
    shopLgcScore = Field()          # 店鋪[物流履約]
    shopAfterSale = Field()         # 店鋪[售後服務]


class CommentItem(Item):
    _id = Field()                   # 評論id
    productId = Field()             # 商品id=sku
    guid = Field()                  # 評論全局唯一標識符
    firstCategory = Field()         # 商品一級類目
    secondCategory = Field()        # 商品二級類目
    thirdCategory = Field()         # 商品三級類目
    score = Field()                 # 用戶評分
    nickname = Field()              # 用戶暱稱
    plusAvailable = Field()         # 用戶賬戶等級(201:PLUS, 103:普通用戶,0:無價值用戶)
    content = Field()               # 評論內容
    creationTime = Field()          # 評論時間
    replyCount = Field()            # 評論的評論數
    usefulVoteCount = Field()       # 用戶評論的被點贊數
    imageCount = Field()            # 評論中圖片的數量


class CommentImageItem(Item):
    _id = Field()                   # 曬圖對應id(1張圖對應1個id)
    commentGuid = Field()           # 曬圖所在評論的全局唯一標識符guid
    imgId = Field()                 # 曬圖對應id
    imgUrl = Field()                # 曬圖url
    imgTitle = Field()              # 曬圖標題
    imgStatus = Field()             # 曬圖狀態


class CommentSummaryItem(Item):
    """商品評論總結"""
    _id = Field()                   # 商品sku
    productId = Field()             # 商品pid
    commentCount = Field()          # 商品累計評論數
    score1Count = Field()           # 用戶評分爲1的數量
    score2Count = Field()           # 用戶評分爲2的數量
    score3Count = Field()           # 用戶評分爲3的數量
    score4Count = Field()           # 用戶評分爲3的數量
    score5Count = Field()           # 用戶評分爲5的數量

2.定義管道文件

  • 【管道文件說明】
  • 數據庫:MongoDB
  • 數據庫名稱:JD
  • 數據庫集合:Categories、Products、Shop、CommentSummary、Comment和CommentImage
  • 處理過程:先判斷待插入數據庫集合類型是否匹配,然後插入,併爲重複數據插入拋出異常

【pipelines.py】

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from JDSpider.items import *


class MongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("localhost", 27017)
        db = clinet["JD"]
        self.Categories = db["Categories"]
        self.Products = db["Products"]
        self.Shop = db["Shop"]
        self.Comment = db["Comment"]
        self.CommentImage = db["CommentImage"]
        self.CommentSummary = db["CommentSummary"]

    def process_item(self, item, spider):
        """ 判斷item的類型,並作相應的處理,再入數據庫 """
        if isinstance(item, CategoriesItem):
            try:
                self.Categories.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, ProductsItem):
            try:
                self.Products.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, ShopItem):
            try:
                self.Shop.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, CommentItem):
            try:
                self.Comment.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, CommentImageItem):
            try:
                self.CommentImage.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, CommentSummaryItem):
            try:
                self.CommentSummary.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, ShopItem):
            try:
                self.Shop.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        return item

3.定義中間件文件

  • 【中間件文件說明】
  • 包括“爬蟲代理中間件”和“緩存中間件”
  • 爬蟲代理中間件:防止連續請求被京東後臺發現並拉黑
  • 緩存中間件:判斷京東後臺服務器響應情況,並作出針對性處理

【middlewares.py】

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

import os
import logging
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.response import response_status_message
from scrapy.downloadermiddlewares.retry import RetryMiddleware
import random

logger = logging.getLogger(__name__)


class UserAgentMiddleware(object):
    """ 換User-Agent """

    def process_request(self, request, spider):
        """設置爬蟲代理"""
        with open("E://proxy.txt", "r") as f:
            PROXIES = f.readlines()
            agent = random.choice(PROXIES)
            agent = agent.strip()
            request.headers["User-Agent"] = agent


class CookiesMiddleware(RetryMiddleware):
    """ 維護Cookie """

    def process_request(self, request, spider):
        pass

    def process_response(self, request, response, spider):
        if response.status in [300, 301, 302, 303]:
            try:
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response  # 重試
            except Exception as e:
                raise IgnoreRequest
        elif response.status in [403, 414]:
            logger.error("%s! Stopping..." % response.status)
            os.system("pause")
        else:
            return response

4.scrapy爬蟲設置文件修改

  • 【修改說明】
  • robot協議:置位False,防止京東網站不允許爬蟲抓取數據
  • 爬蟲最大併發請求:可依據電腦實際性能進行設置
  • 下載中間件優先級:值越小,優先級越高
  • 管道文件優先級:值越小,優先級越高
  • 說明:代碼文件過長,故不再展示

5.商品類目抓取

  • 【商品類目抓取說明】
  • 有些類別裏面包含有很多子類別,所以對於這樣的url,需要再次yield並進行抓取
texts = selector.xpath('//div[@class="category-item m"]/div[@class="mc"]/div[@class="items"]/dl/dd/a').extract()
            for text in texts:
                # 獲取全部三級類目鏈接+三級類目名稱
                items = re.findall(r'<a href="(.*?)" target="_blank">(.*?)</a>', text)
                for item in items:
                    # 判斷“商品鏈接”是否需要繼續請求
                    if item[0].split('.')[0][2:] in key_word:
                        if item[0].split('.')[0][2:] != 'list':
                            yield Request(url='https:' + item[0], callback=self.parse_category)
                        else:
                            # 記錄一級類目:名稱/可提數URL/id編碼
                            categoriesItem = CategoriesItem()
                            categoriesItem['name'] = item[1]
                            categoriesItem['url'] = 'https:' + item[0]
                            categoriesItem['_id'] = item[0].split('=')[1].split('&')[0]
                            yield categoriesItem
                            meta = dict()
                            meta["category"] = item[0].split("=")[1]
                            yield Request(url='https:' + item[0], callback=self.parse_list, meta=meta)

6.商品信息抓取

  • 【店鋪信息抓取說明】
  • 流程:訪問每個類別的url,在產品列表中獲取每個商品對應的url,進入詳情頁面抓取產品的詳情
  • 注意:此處要通過分析得出翻頁請求對應的response地址,並解析規律進行翻頁

【獲取商品鏈接】

selector = Selector(response)
        texts = selector.xpath('//*[@id="J_goodsList"]/ul/li/div/div[@class="p-img"]/a').extract()
        for text in texts:
            items = text.split("=")[3].split('"')[1]
            yield Request(url='https:' + items, callback=self.parse_product, meta=meta)

        # 翻頁[僅翻前50頁]
        maxPage = int(response.xpath('//div[@id="J_filter"]/div/div/span/i/text()').extract()[0])
        if maxPage > 1:
            if maxPage > 50:
                maxPage = 50
            for i in range(2, maxPage):
                num = 2*i - 1
                caterory = meta["category"].split(",")[0]+'%2C' + meta["category"].split(",")[1] + '%2C' + meta["category"].split(",")[2]
                url = list_url % (caterory, num, 30*num)
                print('products next page:', url)
                yield Request(url=url, callback=self.parse_list2, meta=meta)

7.店鋪信息抓取

  • 【店鋪信息抓取說明】
  • 店鋪信息在抓取商品信息的頁面便可以獲取
  • 但是,要區分自營和非自營,因爲自營缺少一些內容
# 商品在售店鋪id+店鋪信息獲取
        shopItem["shopName"] = response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/text()').extract()[0]
        shopItem["_id"] = "https:" + response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/@href').extract()[0]
        productsItem['shopId'] = shopItem["_id"]
        # 區分是否自營
        res = response.xpath('//div[@class="score-parts"]/div/span/em/@title').extract()
        if len(res) == 0:
            shopItem["shopItemScore"] = "京東自營"
            shopItem["shopLgcScore"] = "京東自營"
            shopItem["shopAfterSale"] = "京東自營"
        else:
            shopItem["shopItemScore"] = res[0]
            shopItem["shopLgcScore"] = res[1]
            shopItem["shopAfterSale"] = res[2]
            # shopItem["_id"] = response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/@href').extract()[0].split("-")[1].split(".")[0]
        yield shopItem

8.評論信息抓取

  • 【評論信息抓取說明】
  • 評論的信息也是動態加載,返回的格式也是json,且會不定期進行更新,訪問格式如下:
  • comment_url = 'https://club.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=5&page=%s&pageSize=10'

     

    def parse_comments(self, response):
        """
        獲取商品評論
        :param response: 評論相應的json腳本
        :return:
        """
        try:
            data = json.loads(response.text)
        except Exception as e:
            print('get comment failed:', e)
            return None

        product_id = response.meta['product_id']

        # 商品評論概況獲取[僅導入一次]
        commentSummaryItem = CommentSummaryItem()
        commentSummary = data.get('productCommentSummary')
        commentSummaryItem['_id'] = commentSummary.get('skuId')
        commentSummaryItem['productId'] = commentSummary.get('productId')
        commentSummaryItem['commentCount'] = commentSummary.get('commentCount')
        commentSummaryItem['score1Count'] = commentSummary.get('score1Count')
        commentSummaryItem['score2Count'] = commentSummary.get('score2Count')
        commentSummaryItem['score3Count'] = commentSummary.get('score3Count')
        commentSummaryItem['score4Count'] = commentSummary.get('score4Count')
        commentSummaryItem['score5Count'] = commentSummary.get('score5Count')

        # 判斷commentSummaryItem類型
        yield commentSummaryItem

        # 商品評論[第一頁,剩餘頁面評論由,parse_comments2]
        for comment_item in data['comments']:
            comment = CommentItem()
            comment['_id'] = str(product_id)+","+str(comment_item.get("id"))
            comment['productId'] = product_id
            comment["guid"] = comment_item.get('guid')
            comment['firstCategory'] = comment_item.get('firstCategory')
            comment['secondCategory'] = comment_item.get('secondCategory')
            comment['thirdCategory'] = comment_item.get('thirdCategory')
            comment['score'] = comment_item.get('score')
            comment['nickname'] = comment_item.get('nickname')
            comment['plusAvailable'] = comment_item.get('plusAvailable')
            comment['content'] = comment_item.get('content')
            comment['creationTime'] = comment_item.get('creationTime')
            comment['replyCount'] = comment_item.get('replyCount')
            comment['usefulVoteCount'] = comment_item.get('usefulVoteCount')
            comment['imageCount'] = comment_item.get('imageCount')
            yield comment

            # 存儲當前用戶評論中的圖片
            if 'images' in comment_item:
                for image in comment_item['images']:
                    commentImageItem = CommentImageItem()
                    commentImageItem['commentGuid'] = comment_item.get('guid')
                    commentImageItem['imgId'] = image.get('id')
                    commentImageItem['_id'] = str(product_id)+","+str(comment_item.get('id'))+","+str(image.get('id'))
                    commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl')
                    commentImageItem['imgTitle'] = image.get('imgTitle')
                    commentImageItem['imgStatus'] = image.get('status')
                    yield commentImageItem

        # 評論翻頁[儘量保證評分充足]
        max_page = int(data.get('maxPage', '1'))
        # if max_page > 60:
        #     # 設置評論的最大翻頁數
        #     max_page = 60
        for i in range(1, max_page):
            url = comment_url % (product_id, str(i))
            meta = dict()
            meta['product_id'] = product_id
            yield Request(url=url, callback=self.parse_comments2, meta=meta)

9.抓取過程

10.基本數據展示

有數據需要的可以聯繫,數據非常大

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章