scrapy-redis斷點續爬,持久化爬蟲和url去重,爬取京東圖書

scrapy

scrapy框架是專門爲python爬蟲所設計的框架,它可以實現多線程爬蟲,異步請求運行,雖然不用scrapy框架也可以實現多線程爬蟲,但是功能非常的雞肋,也比較麻煩,而scrapy就可以很簡單的實現了多線程爬蟲,還有許多強大的功能,不懂的也可以取scrapy中文網上面瞭解 https://yiyibooks.cn/zomin/Scrapy15/index.html

scrapy-redis

scrapy-redis是基於scrapy開發的一個功能,它可以實現斷點續爬,url去重,持久化爬蟲,分佈式爬蟲,而且使用也非常簡單,要先去官網把源碼下載下來https://github.com/rmax/scrapy-redis,而主要文件只有三個
在這裏插入圖片描述
而這次比較重要的文件就是scrapy_redis這個文件,下面進行演示scrapy-redis爬取京東圖書信息

建立scrapy爬蟲項目

1、首先在你要建立的目錄下面打開電腦終端,就是用cmd打開,然後建立爬蟲項目,在電腦終端裏面寫上 scrapy startproject jingdong 後面這個jingdong就是項目名稱

2、然後cd進去這個項目裏面建立爬蟲文件,在終端上面寫 scrapy genspider jdbook jd.com 其中jdbook 就是爬蟲文件名,jd.com就是爬蟲的域名範圍,我們建立爬蟲文件時一定要給一個域名範圍,防止爬蟲到一些亂七八糟的url上。

3、要把scrapy-redis官網下載的源碼中的scrapy_redis文件複製到項目當中去,因爲有很多訪問要調用裏面的函數

開始配置setting

1、首先要配置的第一步就是ROBOTSTXT_OBEY,這個要設置爲False,原本是爲True,如果這個爲True的話,就會在爬取網站之前會先去網站的根目錄下面尋找一個robots.txt文件,如果找不到就不會往下面運行
在這裏插入圖片描述
2、模擬瀏覽器,就是要設置User-Agent和一些請求頭
在這裏插入圖片描述
在這裏插入圖片描述
原本這些都是註釋起來的,現在把他們都打開然後把值改爲瀏覽器的值,這裏我們就不用設置cookies如果要設置cookies和代理ip就要在下載中間鍵裏面設置,也就是在middlewares.py這個文件裏面設置,今天爬蟲京東圖書不用設置這些,因爲京東的反爬機制做得不是那麼好,如果想加cookies和代理IP也可以自己添加

3、設置redis數據庫和持久化爬蟲的一些參數

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #定義一個去重的類,用來將url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #指定隊列
SCHEDULER_PERSIST = True  #將程序持久化保存
REDIS_URL = "redis://127.0.0.1:6379"

DUPEFILTER_CLASS是一個去重類,通過這個類就可以添加url,並把url進行去重,就是訪問過的url不再訪問,這裏給大家簡單地講解一些去重規則,首先通過加密方法將url地址加密成一串字符串然後作爲指紋,然後存儲進redis數據庫裏面,然後新的url地址也要通過加密然後成一串字符串,如果數據庫裏面有這一個指紋就不會訪問這個url地址了,如果沒有就訪問這個url,然後再把指紋存進數據庫

SCHEDULER 這個是一個隊列,通過隊列的方式把訪問url地址得到一個request對象,然後把request對象存進隊列中,然後一個一個再把request對象提取出來,就是這裏實現多線程爬蟲

SCHEDULER_PERSIST 這個就是要將程序持久化保存,如果一個正常的scrapy爬蟲,如果不傳遞這個參數,那麼當程序結束了,也會把數據庫清空,就達不到程序持久化保存,和斷點續爬的效果

REDIS_URL 這個就是redis的地址

4、如果想要保存數據還要把數據管道給開了

ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 300,
}

開始分析網站

首先我們進入到圖書分類的官網
https://book.jd.com/booksort.html

在這裏插入圖片描述
然後我們可以看到每個圖書大分類下面都有很多的小分類,然後我們可以按f12看一些數據在不在這個url地址裏面
在這裏插入圖片描述
然後我們可以看到數據都在裏面,包括大分類的名字和url地址和小分類的名字和url地址

提取圖書大分類和小分類

首先提取大分類的名字

all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #獲取大分類
        for i in all_list:
            item = JingdongItem()
            item['all_class'] = i.xpath("./a/text()").extract_first()

然後我們發現小分類是大分類的兄弟節點的下一個節點,那我們就可以這樣提取
然後提取小分類的名字和url地址

class_list = i.xpath("./following-sibling::dd[1]/em")  #用兄弟節點的方式來獲取同一節點的下一個節點
            for j in class_list:
                item["next_class_name"] = j.xpath("./a/text()").extract_first()  #獲取小分類的名字
                item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())

進入小分類的url裏面,獲取圖書部分信息

然後我們就進入到小分類的詳情頁裏面,查看裏面有多少數據是我們要的
在這裏插入圖片描述
然後我們再進入到網頁源碼裏面看一下有沒有我們想要的數據
在這裏插入圖片描述
然後就可以發現裏面有圖書的名字、url地址、作者名、出版社的名字、店鋪的名字和url地址、出版時間,這些數據都是我們想要,就先把他們提取出來,另外的圖書價格和評論數可以去圖書的詳情頁裏面獲得

    def next_parse(self,response):
        item = response.meta["item"]
        book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
        for i in book_list:
            try:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip()  #去掉空格和換行符
            except:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
            item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
            item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
            item['publisher_url'] ="https:"  + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
            try:
                item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
            except:
                item["publish_time"] = i.xpath(
                    "./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
            item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()

獲得圖書的價格和評論數

我們先分析一下價格和評論數在哪裏,經過我的分析,我發現價格和評論數在圖書的url裏面沒有,它是通過js生成的json數據,我們可以找一下它生成在哪裏,經過我分析發現價格在這個網址裏面https://p.3.cn/prices/mgets?type=1&skuIds=J_11892005346&pdtk=&pduid=1551774170386597393748&pdpin=&pdbp=0&callback=jQuery6622062&_=1560704913535
經過分析這個url地址有些參數是不需要的,最終的url是https://p.3.cn/prices/mgets?skuIds=J_11892005346
然後每本書的價格只要修改J_後面的參數就可以獲得每本書的價格,而這個參數在小分類的url地址裏面存在,只要把參數提取出來就可以獲得響應的書本的價格,裏面有書本的原價和購買價格

    def parse_dateli(self,response):
        item = response.meta["item"]
        js = json.loads(response.body)
        item['original_price'] = js[0]['m']
        item['price'] = js[0]['p']

提取評論數

書本的評論數同樣是生成了json數據,所以我們可以按照剛纔尋找價格的方法來尋找商品的評論數,這裏就不詳細說了,不懂也可以來問我

    def parse_comment(self,response):
        item = response.meta["item"]
        js = json.loads(response.text)
        item['comment'] = js['CommentsCount'][0]['CommentCount']

設置items

items是數據管道,數據管道的使用也是非常簡單,先在數據管道里面設定一些參數

import scrapy


class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    all_class=scrapy.Field()
    next_class_name = scrapy.Field()
    next_class_url = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    comment = scrapy.Field()
    price = scrapy.Field()
    publisher_name = scrapy.Field()
    publisher_url = scrapy.Field()
    publish_time = scrapy.Field()
    author = scrapy.Field()
    original_price = scrapy.Field()

然後在爬蟲文件裏面導入這個文件的JingdongItem類,然後用一個參數接收,比如

item = JingdongItem()

記得這個參數是在爬蟲文件裏面設置的,下面提供源碼,就可以知道在哪裏設置的

pepelines 的設置

pipelines主要是用來保存數據的,這些數據可以保存在mysql數據庫、mongodb數據庫,或者保存在csv文件中,我這裏是保存在csv文件上

import pandas as pd

class JingdongPipeline(object):
    def open_spider(self,spider):
        print("開始運行")
    def process_item(self, item, spider):
        print(item)
        data_list = []
        data_list.append(dict(item))
        # for i in self.list:
        #     if i not in self.data_list:
        #         self.data_list.append(i)
        data_frame = pd.DataFrame(data_list)
        # data_frame["id"] = self.id
        # data_frame.set_index("id",drop=True)
        data_frame.index.name = "id"
        pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
        print("寫入成功")
        return item
    def close_spider(self,spider):
        print("運行結束")

注意

注意兩點
1、域名範圍,這裏的小分類和圖書信息的url地址的域名都是不同的,所以要在上面的allowed_domains添加域名進去

2、yield scrapy.Request的使用,這裏面設置到一個參數就是meta,因爲要傳遞值給下面的函數接着使用,所以要傳遞這個參數,舉一個例子

yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})

這裏用到了copy模塊的一個方法,就是deepcopy,這裏要用deepcopy把數據給拷貝起來,然後傳遞給下面的函數,如果不把它給拷貝下來,就會導致數據錯亂,因爲scrapy是一個異步請求的過程,所以極有可能就是上面已經處理好第三個數據了,但是還沒開始傳遞第一個數據,就會導致數據交叉在一起,所以先把數據給拷貝下來就不會導致數據錯亂的情況發生

爬蟲文件源碼

import scrapy
from copy import deepcopy
import re
import json
from jingdong.items import JingdongItem
class JdbookSpider(scrapy.Spider):
    name = 'jdbook'
    allowed_domains = ['jd.com','p.3.cn','club.jd.com']
    start_urls = ['https://book.jd.com/booksort.html']

    def parse(self, response):
        all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #獲取大分類
        for i in all_list:
            item = JingdongItem()
            item['all_class'] = i.xpath("./a/text()").extract_first()
            class_list = i.xpath("./following-sibling::dd[1]/em")  #用兄弟節點的方式來獲取同一節點的下一個節點
            for j in class_list:
                item["next_class_name"] = j.xpath("./a/text()").extract_first()  #獲取小分類的名字
                item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
                yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})

    def next_parse(self,response):
        item = response.meta["item"]
        book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
        for i in book_list:
            try:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip()  #去掉空格和換行符
            except:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
            item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
            item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
            item['publisher_url'] ="https:"  + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
            try:
                item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
            except:
                item["publish_time"] = i.xpath(
                    "./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
            item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
            data_sku = "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/@data-sku").extract())
            yield scrapy.Request(url="https://p.3.cn/prices/mgets?skuIds=J_{}".format(data_sku),callback=self.parse_dateli,meta={"item":deepcopy(item)})
        next_page_url = "https://list.jd.com"+ "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
        judge = "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
        if judge is not None:
            yield scrapy.Request(url=next_page_url,callback=self.next_parse,meta={"item":deepcopy(item)})
    def parse_dateli(self,response):
        item = response.meta["item"]
        js = json.loads(response.body)
        item['original_price'] = js[0]['m']
        item['price'] = js[0]['p']
        id = js[0]["id"]
        id = "".join(re.findall(r'\d\.*\d*',id))
        yield scrapy.Request(url="https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}".format(id),callback=self.parse_comment,meta={"item":deepcopy(item)})

    def parse_comment(self,response):
        item = response.meta["item"]
        js = json.loads(response.text)
        item['comment'] = js['CommentsCount'][0]['CommentCount']
        yield deepcopy(item)
        # print(deepcopy(item))

setting源碼

# -*- coding: utf-8 -*-

# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jingdong'

SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #定義一個去重的類,用來將url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #指定隊列
SCHEDULER_PERSIST = True  #將程序持久化保存

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

LOG_LEVEL = "DEBUG"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

REDIS_URL = "redis://127.0.0.1:6379"   #要把數據寫入redis數據庫還要添加這個參數

items源碼

import scrapy


class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    all_class=scrapy.Field()
    next_class_name = scrapy.Field()
    next_class_url = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    comment = scrapy.Field()
    price = scrapy.Field()
    publisher_name = scrapy.Field()
    publisher_url = scrapy.Field()
    publish_time = scrapy.Field()
    author = scrapy.Field()
    original_price = scrapy.Field()

pipelines源碼

import pandas as pd

class JingdongPipeline(object):
    def open_spider(self,spider):
        print("開始運行")
    def process_item(self, item, spider):
        print(item)
        data_list = []
        data_list.append(dict(item))
        # for i in self.list:
        #     if i not in self.data_list:
        #         self.data_list.append(i)
        data_frame = pd.DataFrame(data_list)
        # data_frame["id"] = self.id
        # data_frame.set_index("id",drop=True)
        data_frame.index.name = "id"
        pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
        print("寫入成功")
        return item
    def close_spider(self,spider):
        print("運行結束")

有什麼不懂的地方可以加我的QQ詢問 1693490575

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章