網絡爬蟲 | 京東全站數據採集（類目、店鋪、商品、評論）——基於Python中Scrapy框架

1.定義採集數據的存儲結構

【存儲結構說明】

class CategoriesItem(Item)：存儲京東類目信息

class ProductsItem(Item)：存儲京東商品信息

class ShopItem(Item)：存儲京東店鋪信息

class CommentSummaryItem(Item)：存儲京東每個商品的評論概況信息

class CommentItem(Item)：存儲京東每個商品的評論基本信息

class CommentImageItem(Item)：存儲京東每個商品中每條評論的圖像信息

說明：類中所定義字段可依據具體採集要求或response內容進行調整

【items.py程序】

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field


class CategoriesItem(Item):
    """
    存儲京東類目信息
    """
    name = Field()    # 商品三級類目名稱
    url = Field()     # 商品三級類目對應url
    _id = Field()     # 商品類目對應id[一級id,二級id,三級id]


class ProductsItem(Item):
    """
    存儲京東商品信息
    """
    name = Field()                  # 商品名稱
    url = Field()                   # 商品url[用於商品主圖提取]
    _id = Field()                   # 商品sku
    category = Field()              # 商品三級類目
    description = Field()           # 商品描述
    shopId = Field()                # 商品所在店鋪id(名稱)
    commentCount = Field()          # 商品評價總數=CommentSummaryItem.commentCount
    # goodComment = Field()           # 商品好評數
    # generalComment = Field()        # 商品中評數
    # poolComment = Field()           # 商品差評數
    # favourableDesc1 = Field()       # 商品優惠描述1
    # favourableDesc2 = Field()       # 商品優惠描述2
    # venderId = Field()              # 供應商id
    # reallyPrice = Field()           # 商品現價
    # originalPrice = Field()         # 商品原價


class ShopItem(Item):
    _id = Field()                   # 店鋪url
    shopName = Field()              # 店鋪名稱
    shopItemScore = Field()         # 店鋪[商品評價]
    shopLgcScore = Field()          # 店鋪[物流履約]
    shopAfterSale = Field()         # 店鋪[售後服務]


class CommentItem(Item):
    _id = Field()                   # 評論id
    productId = Field()             # 商品id=sku
    guid = Field()                  # 評論全局唯一標識符
    firstCategory = Field()         # 商品一級類目
    secondCategory = Field()        # 商品二級類目
    thirdCategory = Field()         # 商品三級類目
    score = Field()                 # 用戶評分
    nickname = Field()              # 用戶暱稱
    plusAvailable = Field()         # 用戶賬戶等級(201：PLUS, 103:普通用戶，0：無價值用戶)
    content = Field()               # 評論內容
    creationTime = Field()          # 評論時間
    replyCount = Field()            # 評論的評論數
    usefulVoteCount = Field()       # 用戶評論的被點贊數
    imageCount = Field()            # 評論中圖片的數量


class CommentImageItem(Item):
    _id = Field()                   # 曬圖對應id(1張圖對應1個id)
    commentGuid = Field()           # 曬圖所在評論的全局唯一標識符guid
    imgId = Field()                 # 曬圖對應id
    imgUrl = Field()                # 曬圖url
    imgTitle = Field()              # 曬圖標題
    imgStatus = Field()             # 曬圖狀態


class CommentSummaryItem(Item):
    """商品評論總結"""
    _id = Field()                   # 商品sku
    productId = Field()             # 商品pid
    commentCount = Field()          # 商品累計評論數
    score1Count = Field()           # 用戶評分爲1的數量
    score2Count = Field()           # 用戶評分爲2的數量
    score3Count = Field()           # 用戶評分爲3的數量
    score4Count = Field()           # 用戶評分爲3的數量
    score5Count = Field()           # 用戶評分爲5的數量

2.定義管道文件

【管道文件說明】

數據庫：MongoDB

數據庫名稱：JD

數據庫集合：Categories、Products、Shop、CommentSummary、Comment和CommentImage

處理過程：先判斷待插入數據庫集合類型是否匹配，然後插入，併爲重複數據插入拋出異常

【pipelines.py】

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo
from JDSpider.items import *


class MongoDBPipeline(object):
    def __init__(self):
        clinet = pymongo.MongoClient("localhost", 27017)
        db = clinet["JD"]
        self.Categories = db["Categories"]
        self.Products = db["Products"]
        self.Shop = db["Shop"]
        self.Comment = db["Comment"]
        self.CommentImage = db["CommentImage"]
        self.CommentSummary = db["CommentSummary"]

    def process_item(self, item, spider):
        """ 判斷item的類型，並作相應的處理，再入數據庫 """
        if isinstance(item, CategoriesItem):
            try:
                self.Categories.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, ProductsItem):
            try:
                self.Products.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, ShopItem):
            try:
                self.Shop.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, CommentItem):
            try:
                self.Comment.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, CommentImageItem):
            try:
                self.CommentImage.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, CommentSummaryItem):
            try:
                self.CommentSummary.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        elif isinstance(item, ShopItem):
            try:
                self.Shop.insert(dict(item))
            except Exception as e:
                print('get failed:', e)
        return item

3.定義中間件文件

【中間件文件說明】

包括“爬蟲代理中間件”和“緩存中間件”

爬蟲代理中間件：防止連續請求被京東後臺發現並拉黑

緩存中間件：判斷京東後臺服務器響應情況，並作出針對性處理

【middlewares.py】

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html

import os
import logging
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.response import response_status_message
from scrapy.downloadermiddlewares.retry import RetryMiddleware
import random

logger = logging.getLogger(__name__)


class UserAgentMiddleware(object):
    """ 換User-Agent """

    def process_request(self, request, spider):
        """設置爬蟲代理"""
        with open("E://proxy.txt", "r") as f:
            PROXIES = f.readlines()
            agent = random.choice(PROXIES)
            agent = agent.strip()
            request.headers["User-Agent"] = agent


class CookiesMiddleware(RetryMiddleware):
    """ 維護Cookie """

    def process_request(self, request, spider):
        pass

    def process_response(self, request, response, spider):
        if response.status in [300, 301, 302, 303]:
            try:
                reason = response_status_message(response.status)
                return self._retry(request, reason, spider) or response  # 重試
            except Exception as e:
                raise IgnoreRequest
        elif response.status in [403, 414]:
            logger.error("%s! Stopping..." % response.status)
            os.system("pause")
        else:
            return response

4.scrapy爬蟲設置文件修改

【修改說明】

robot協議：置位False，防止京東網站不允許爬蟲抓取數據

爬蟲最大併發請求：可依據電腦實際性能進行設置

下載中間件優先級：值越小，優先級越高

管道文件優先級：值越小，優先級越高

說明：代碼文件過長，故不再展示

5.商品類目抓取

【商品類目抓取說明】

有些類別裏面包含有很多子類別，所以對於這樣的url，需要再次yield並進行抓取

texts = selector.xpath('//div[@class="category-item m"]/div[@class="mc"]/div[@class="items"]/dl/dd/a').extract()
            for text in texts:
                # 獲取全部三級類目鏈接+三級類目名稱
                items = re.findall(r'<a href="(.*?)" target="_blank">(.*?)</a>', text)
                for item in items:
                    # 判斷“商品鏈接”是否需要繼續請求
                    if item[0].split('.')[0][2:] in key_word:
                        if item[0].split('.')[0][2:] != 'list':
                            yield Request(url='https:' + item[0], callback=self.parse_category)
                        else:
                            # 記錄一級類目：名稱/可提數URL/id編碼
                            categoriesItem = CategoriesItem()
                            categoriesItem['name'] = item[1]
                            categoriesItem['url'] = 'https:' + item[0]
                            categoriesItem['_id'] = item[0].split('=')[1].split('&')[0]
                            yield categoriesItem
                            meta = dict()
                            meta["category"] = item[0].split("=")[1]
                            yield Request(url='https:' + item[0], callback=self.parse_list, meta=meta)

6.商品信息抓取

【店鋪信息抓取說明】

流程：訪問每個類別的url，在產品列表中獲取每個商品對應的url,進入詳情頁面抓取產品的詳情

注意：此處要通過分析得出翻頁請求對應的response地址，並解析規律進行翻頁

【獲取商品鏈接】

selector = Selector(response)
        texts = selector.xpath('//*[@id="J_goodsList"]/ul/li/div/div[@class="p-img"]/a').extract()
        for text in texts:
            items = text.split("=")[3].split('"')[1]
            yield Request(url='https:' + items, callback=self.parse_product, meta=meta)

        # 翻頁[僅翻前50頁]
        maxPage = int(response.xpath('//div[@id="J_filter"]/div/div/span/i/text()').extract()[0])
        if maxPage > 1:
            if maxPage > 50:
                maxPage = 50
            for i in range(2, maxPage):
                num = 2*i - 1
                caterory = meta["category"].split(",")[0]+'%2C' + meta["category"].split(",")[1] + '%2C' + meta["category"].split(",")[2]
                url = list_url % (caterory, num, 30*num)
                print('products next page:', url)
                yield Request(url=url, callback=self.parse_list2, meta=meta)

7.店鋪信息抓取

【店鋪信息抓取說明】

店鋪信息在抓取商品信息的頁面便可以獲取

但是，要區分自營和非自營，因爲自營缺少一些內容

# 商品在售店鋪id+店鋪信息獲取
        shopItem["shopName"] = response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/text()').extract()[0]
        shopItem["_id"] = "https:" + response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/@href').extract()[0]
        productsItem['shopId'] = shopItem["_id"]
        # 區分是否自營
        res = response.xpath('//div[@class="score-parts"]/div/span/em/@title').extract()
        if len(res) == 0:
            shopItem["shopItemScore"] = "京東自營"
            shopItem["shopLgcScore"] = "京東自營"
            shopItem["shopAfterSale"] = "京東自營"
        else:
            shopItem["shopItemScore"] = res[0]
            shopItem["shopLgcScore"] = res[1]
            shopItem["shopAfterSale"] = res[2]
            # shopItem["_id"] = response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/@href').extract()[0].split("-")[1].split(".")[0]
        yield shopItem

8.評論信息抓取

【評論信息抓取說明】

評論的信息也是動態加載，返回的格式也是json，且會不定期進行更新，訪問格式如下：
comment_url = 'https://club.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=5&page=%s&pageSize=10'

    def parse_comments(self, response):
        """
        獲取商品評論
        :param response: 評論相應的json腳本
        :return:
        """
        try:
            data = json.loads(response.text)
        except Exception as e:
            print('get comment failed:', e)
            return None

        product_id = response.meta['product_id']

        # 商品評論概況獲取[僅導入一次]
        commentSummaryItem = CommentSummaryItem()
        commentSummary = data.get('productCommentSummary')
        commentSummaryItem['_id'] = commentSummary.get('skuId')
        commentSummaryItem['productId'] = commentSummary.get('productId')
        commentSummaryItem['commentCount'] = commentSummary.get('commentCount')
        commentSummaryItem['score1Count'] = commentSummary.get('score1Count')
        commentSummaryItem['score2Count'] = commentSummary.get('score2Count')
        commentSummaryItem['score3Count'] = commentSummary.get('score3Count')
        commentSummaryItem['score4Count'] = commentSummary.get('score4Count')
        commentSummaryItem['score5Count'] = commentSummary.get('score5Count')

        # 判斷commentSummaryItem類型
        yield commentSummaryItem

        # 商品評論[第一頁，剩餘頁面評論由，parse_comments2]
        for comment_item in data['comments']:
            comment = CommentItem()
            comment['_id'] = str(product_id)+","+str(comment_item.get("id"))
            comment['productId'] = product_id
            comment["guid"] = comment_item.get('guid')
            comment['firstCategory'] = comment_item.get('firstCategory')
            comment['secondCategory'] = comment_item.get('secondCategory')
            comment['thirdCategory'] = comment_item.get('thirdCategory')
            comment['score'] = comment_item.get('score')
            comment['nickname'] = comment_item.get('nickname')
            comment['plusAvailable'] = comment_item.get('plusAvailable')
            comment['content'] = comment_item.get('content')
            comment['creationTime'] = comment_item.get('creationTime')
            comment['replyCount'] = comment_item.get('replyCount')
            comment['usefulVoteCount'] = comment_item.get('usefulVoteCount')
            comment['imageCount'] = comment_item.get('imageCount')
            yield comment

            # 存儲當前用戶評論中的圖片
            if 'images' in comment_item:
                for image in comment_item['images']:
                    commentImageItem = CommentImageItem()
                    commentImageItem['commentGuid'] = comment_item.get('guid')
                    commentImageItem['imgId'] = image.get('id')
                    commentImageItem['_id'] = str(product_id)+","+str(comment_item.get('id'))+","+str(image.get('id'))
                    commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl')
                    commentImageItem['imgTitle'] = image.get('imgTitle')
                    commentImageItem['imgStatus'] = image.get('status')
                    yield commentImageItem

        # 評論翻頁[儘量保證評分充足]
        max_page = int(data.get('maxPage', '1'))
        # if max_page > 60:
        #     # 設置評論的最大翻頁數
        #     max_page = 60
        for i in range(1, max_page):
            url = comment_url % (product_id, str(i))
            meta = dict()
            meta['product_id'] = product_id
            yield Request(url=url, callback=self.parse_comments2, meta=meta)

9.抓取過程

10.基本數據展示

有數據需要的可以聯繫，數據非常大

網絡爬蟲 | 京東全站數據採集（類目、店鋪、商品、評論）——基於Python中Scrapy框架

1.定義採集數據的存儲結構

2.定義管道文件

3.定義中間件文件

4.scrapy爬蟲設置文件修改

5.商品類目抓取

6.商品信息抓取

7.店鋪信息抓取

8.評論信息抓取

9.抓取過程

10.基本數據展示

機器學習 | 特徵工程 —— 降維：PCA（主成分分析）

認知物理學思維導圖

python3深度學習卷積神經網絡(CNN)：VGGNet / Finetuning

Google瀏覽器截圖方法

python3深度學習過擬合/欠擬合的處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結