1.定義採集數據的存儲結構
- 【存儲結構說明】
- class CategoriesItem(Item):存儲京東類目信息
- class ProductsItem(Item):存儲京東商品信息
- class ShopItem(Item):存儲京東店鋪信息
- class CommentSummaryItem(Item):存儲京東每個商品的評論概況信息
- class CommentItem(Item):存儲京東每個商品的評論基本信息
- class CommentImageItem(Item):存儲京東每個商品中每條評論的圖像信息
- 說明:類中所定義字段可依據具體採集要求或response內容進行調整
【items.py程序】
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
from scrapy import Item, Field
class CategoriesItem(Item):
"""
存儲京東類目信息
"""
name = Field() # 商品三級類目名稱
url = Field() # 商品三級類目對應url
_id = Field() # 商品類目對應id[一級id,二級id,三級id]
class ProductsItem(Item):
"""
存儲京東商品信息
"""
name = Field() # 商品名稱
url = Field() # 商品url[用於商品主圖提取]
_id = Field() # 商品sku
category = Field() # 商品三級類目
description = Field() # 商品描述
shopId = Field() # 商品所在店鋪id(名稱)
commentCount = Field() # 商品評價總數=CommentSummaryItem.commentCount
# goodComment = Field() # 商品好評數
# generalComment = Field() # 商品中評數
# poolComment = Field() # 商品差評數
# favourableDesc1 = Field() # 商品優惠描述1
# favourableDesc2 = Field() # 商品優惠描述2
# venderId = Field() # 供應商id
# reallyPrice = Field() # 商品現價
# originalPrice = Field() # 商品原價
class ShopItem(Item):
_id = Field() # 店鋪url
shopName = Field() # 店鋪名稱
shopItemScore = Field() # 店鋪[商品評價]
shopLgcScore = Field() # 店鋪[物流履約]
shopAfterSale = Field() # 店鋪[售後服務]
class CommentItem(Item):
_id = Field() # 評論id
productId = Field() # 商品id=sku
guid = Field() # 評論全局唯一標識符
firstCategory = Field() # 商品一級類目
secondCategory = Field() # 商品二級類目
thirdCategory = Field() # 商品三級類目
score = Field() # 用戶評分
nickname = Field() # 用戶暱稱
plusAvailable = Field() # 用戶賬戶等級(201:PLUS, 103:普通用戶,0:無價值用戶)
content = Field() # 評論內容
creationTime = Field() # 評論時間
replyCount = Field() # 評論的評論數
usefulVoteCount = Field() # 用戶評論的被點贊數
imageCount = Field() # 評論中圖片的數量
class CommentImageItem(Item):
_id = Field() # 曬圖對應id(1張圖對應1個id)
commentGuid = Field() # 曬圖所在評論的全局唯一標識符guid
imgId = Field() # 曬圖對應id
imgUrl = Field() # 曬圖url
imgTitle = Field() # 曬圖標題
imgStatus = Field() # 曬圖狀態
class CommentSummaryItem(Item):
"""商品評論總結"""
_id = Field() # 商品sku
productId = Field() # 商品pid
commentCount = Field() # 商品累計評論數
score1Count = Field() # 用戶評分爲1的數量
score2Count = Field() # 用戶評分爲2的數量
score3Count = Field() # 用戶評分爲3的數量
score4Count = Field() # 用戶評分爲3的數量
score5Count = Field() # 用戶評分爲5的數量
2.定義管道文件
- 【管道文件說明】
- 數據庫:MongoDB
- 數據庫名稱:JD
- 數據庫集合:Categories、Products、Shop、CommentSummary、Comment和CommentImage
- 處理過程:先判斷待插入數據庫集合類型是否匹配,然後插入,併爲重複數據插入拋出異常
【pipelines.py】
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from JDSpider.items import *
class MongoDBPipeline(object):
def __init__(self):
clinet = pymongo.MongoClient("localhost", 27017)
db = clinet["JD"]
self.Categories = db["Categories"]
self.Products = db["Products"]
self.Shop = db["Shop"]
self.Comment = db["Comment"]
self.CommentImage = db["CommentImage"]
self.CommentSummary = db["CommentSummary"]
def process_item(self, item, spider):
""" 判斷item的類型,並作相應的處理,再入數據庫 """
if isinstance(item, CategoriesItem):
try:
self.Categories.insert(dict(item))
except Exception as e:
print('get failed:', e)
elif isinstance(item, ProductsItem):
try:
self.Products.insert(dict(item))
except Exception as e:
print('get failed:', e)
elif isinstance(item, ShopItem):
try:
self.Shop.insert(dict(item))
except Exception as e:
print('get failed:', e)
elif isinstance(item, CommentItem):
try:
self.Comment.insert(dict(item))
except Exception as e:
print('get failed:', e)
elif isinstance(item, CommentImageItem):
try:
self.CommentImage.insert(dict(item))
except Exception as e:
print('get failed:', e)
elif isinstance(item, CommentSummaryItem):
try:
self.CommentSummary.insert(dict(item))
except Exception as e:
print('get failed:', e)
elif isinstance(item, ShopItem):
try:
self.Shop.insert(dict(item))
except Exception as e:
print('get failed:', e)
return item
3.定義中間件文件
- 【中間件文件說明】
- 包括“爬蟲代理中間件”和“緩存中間件”
- 爬蟲代理中間件:防止連續請求被京東後臺發現並拉黑
- 緩存中間件:判斷京東後臺服務器響應情況,並作出針對性處理
【middlewares.py】
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/spider-middleware.html
import os
import logging
from scrapy.exceptions import IgnoreRequest
from scrapy.utils.response import response_status_message
from scrapy.downloadermiddlewares.retry import RetryMiddleware
import random
logger = logging.getLogger(__name__)
class UserAgentMiddleware(object):
""" 換User-Agent """
def process_request(self, request, spider):
"""設置爬蟲代理"""
with open("E://proxy.txt", "r") as f:
PROXIES = f.readlines()
agent = random.choice(PROXIES)
agent = agent.strip()
request.headers["User-Agent"] = agent
class CookiesMiddleware(RetryMiddleware):
""" 維護Cookie """
def process_request(self, request, spider):
pass
def process_response(self, request, response, spider):
if response.status in [300, 301, 302, 303]:
try:
reason = response_status_message(response.status)
return self._retry(request, reason, spider) or response # 重試
except Exception as e:
raise IgnoreRequest
elif response.status in [403, 414]:
logger.error("%s! Stopping..." % response.status)
os.system("pause")
else:
return response
4.scrapy爬蟲設置文件修改
- 【修改說明】
- robot協議:置位False,防止京東網站不允許爬蟲抓取數據
- 爬蟲最大併發請求:可依據電腦實際性能進行設置
- 下載中間件優先級:值越小,優先級越高
- 管道文件優先級:值越小,優先級越高
- 說明:代碼文件過長,故不再展示
5.商品類目抓取
- 【商品類目抓取說明】
- 有些類別裏面包含有很多子類別,所以對於這樣的url,需要再次yield並進行抓取
texts = selector.xpath('//div[@class="category-item m"]/div[@class="mc"]/div[@class="items"]/dl/dd/a').extract()
for text in texts:
# 獲取全部三級類目鏈接+三級類目名稱
items = re.findall(r'<a href="(.*?)" target="_blank">(.*?)</a>', text)
for item in items:
# 判斷“商品鏈接”是否需要繼續請求
if item[0].split('.')[0][2:] in key_word:
if item[0].split('.')[0][2:] != 'list':
yield Request(url='https:' + item[0], callback=self.parse_category)
else:
# 記錄一級類目:名稱/可提數URL/id編碼
categoriesItem = CategoriesItem()
categoriesItem['name'] = item[1]
categoriesItem['url'] = 'https:' + item[0]
categoriesItem['_id'] = item[0].split('=')[1].split('&')[0]
yield categoriesItem
meta = dict()
meta["category"] = item[0].split("=")[1]
yield Request(url='https:' + item[0], callback=self.parse_list, meta=meta)
6.商品信息抓取
- 【店鋪信息抓取說明】
- 流程:訪問每個類別的url,在產品列表中獲取每個商品對應的url,進入詳情頁面抓取產品的詳情
- 注意:此處要通過分析得出翻頁請求對應的response地址,並解析規律進行翻頁
【獲取商品鏈接】
selector = Selector(response)
texts = selector.xpath('//*[@id="J_goodsList"]/ul/li/div/div[@class="p-img"]/a').extract()
for text in texts:
items = text.split("=")[3].split('"')[1]
yield Request(url='https:' + items, callback=self.parse_product, meta=meta)
# 翻頁[僅翻前50頁]
maxPage = int(response.xpath('//div[@id="J_filter"]/div/div/span/i/text()').extract()[0])
if maxPage > 1:
if maxPage > 50:
maxPage = 50
for i in range(2, maxPage):
num = 2*i - 1
caterory = meta["category"].split(",")[0]+'%2C' + meta["category"].split(",")[1] + '%2C' + meta["category"].split(",")[2]
url = list_url % (caterory, num, 30*num)
print('products next page:', url)
yield Request(url=url, callback=self.parse_list2, meta=meta)
7.店鋪信息抓取
- 【店鋪信息抓取說明】
- 店鋪信息在抓取商品信息的頁面便可以獲取
- 但是,要區分自營和非自營,因爲自營缺少一些內容
# 商品在售店鋪id+店鋪信息獲取
shopItem["shopName"] = response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/text()').extract()[0]
shopItem["_id"] = "https:" + response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/@href').extract()[0]
productsItem['shopId'] = shopItem["_id"]
# 區分是否自營
res = response.xpath('//div[@class="score-parts"]/div/span/em/@title').extract()
if len(res) == 0:
shopItem["shopItemScore"] = "京東自營"
shopItem["shopLgcScore"] = "京東自營"
shopItem["shopAfterSale"] = "京東自營"
else:
shopItem["shopItemScore"] = res[0]
shopItem["shopLgcScore"] = res[1]
shopItem["shopAfterSale"] = res[2]
# shopItem["_id"] = response.xpath('//div[@class="m m-aside popbox"]/div/div/h3/a/@href').extract()[0].split("-")[1].split(".")[0]
yield shopItem
8.評論信息抓取
- 【評論信息抓取說明】
- 評論的信息也是動態加載,返回的格式也是json,且會不定期進行更新,訪問格式如下:
comment_url = 'https://club.jd.com/comment/productPageComments.action?productId=%s&score=0&sortType=5&page=%s&pageSize=10'
def parse_comments(self, response):
"""
獲取商品評論
:param response: 評論相應的json腳本
:return:
"""
try:
data = json.loads(response.text)
except Exception as e:
print('get comment failed:', e)
return None
product_id = response.meta['product_id']
# 商品評論概況獲取[僅導入一次]
commentSummaryItem = CommentSummaryItem()
commentSummary = data.get('productCommentSummary')
commentSummaryItem['_id'] = commentSummary.get('skuId')
commentSummaryItem['productId'] = commentSummary.get('productId')
commentSummaryItem['commentCount'] = commentSummary.get('commentCount')
commentSummaryItem['score1Count'] = commentSummary.get('score1Count')
commentSummaryItem['score2Count'] = commentSummary.get('score2Count')
commentSummaryItem['score3Count'] = commentSummary.get('score3Count')
commentSummaryItem['score4Count'] = commentSummary.get('score4Count')
commentSummaryItem['score5Count'] = commentSummary.get('score5Count')
# 判斷commentSummaryItem類型
yield commentSummaryItem
# 商品評論[第一頁,剩餘頁面評論由,parse_comments2]
for comment_item in data['comments']:
comment = CommentItem()
comment['_id'] = str(product_id)+","+str(comment_item.get("id"))
comment['productId'] = product_id
comment["guid"] = comment_item.get('guid')
comment['firstCategory'] = comment_item.get('firstCategory')
comment['secondCategory'] = comment_item.get('secondCategory')
comment['thirdCategory'] = comment_item.get('thirdCategory')
comment['score'] = comment_item.get('score')
comment['nickname'] = comment_item.get('nickname')
comment['plusAvailable'] = comment_item.get('plusAvailable')
comment['content'] = comment_item.get('content')
comment['creationTime'] = comment_item.get('creationTime')
comment['replyCount'] = comment_item.get('replyCount')
comment['usefulVoteCount'] = comment_item.get('usefulVoteCount')
comment['imageCount'] = comment_item.get('imageCount')
yield comment
# 存儲當前用戶評論中的圖片
if 'images' in comment_item:
for image in comment_item['images']:
commentImageItem = CommentImageItem()
commentImageItem['commentGuid'] = comment_item.get('guid')
commentImageItem['imgId'] = image.get('id')
commentImageItem['_id'] = str(product_id)+","+str(comment_item.get('id'))+","+str(image.get('id'))
commentImageItem['imgUrl'] = 'http:' + image.get('imgUrl')
commentImageItem['imgTitle'] = image.get('imgTitle')
commentImageItem['imgStatus'] = image.get('status')
yield commentImageItem
# 評論翻頁[儘量保證評分充足]
max_page = int(data.get('maxPage', '1'))
# if max_page > 60:
# # 設置評論的最大翻頁數
# max_page = 60
for i in range(1, max_page):
url = comment_url % (product_id, str(i))
meta = dict()
meta['product_id'] = product_id
yield Request(url=url, callback=self.parse_comments2, meta=meta)
9.抓取過程
10.基本數據展示
有數據需要的可以聯繫,數據非常大