scrapy

scrapy框架是專門爲python爬蟲所設計的框架，它可以實現多線程爬蟲，異步請求運行，雖然不用scrapy框架也可以實現多線程爬蟲，但是功能非常的雞肋，也比較麻煩，而scrapy就可以很簡單的實現了多線程爬蟲，還有許多強大的功能，不懂的也可以取scrapy中文網上面瞭解 https://yiyibooks.cn/zomin/Scrapy15/index.html

scrapy-redis

scrapy-redis是基於scrapy開發的一個功能，它可以實現斷點續爬，url去重，持久化爬蟲，分佈式爬蟲，而且使用也非常簡單，要先去官網把源碼下載下來https://github.com/rmax/scrapy-redis，而主要文件只有三個

而這次比較重要的文件就是scrapy_redis這個文件，下面進行演示scrapy-redis爬取京東圖書信息

建立scrapy爬蟲項目

1、首先在你要建立的目錄下面打開電腦終端，就是用cmd打開，然後建立爬蟲項目，在電腦終端裏面寫上 scrapy startproject jingdong 後面這個jingdong就是項目名稱

2、然後cd進去這個項目裏面建立爬蟲文件，在終端上面寫 scrapy genspider jdbook jd.com 其中jdbook 就是爬蟲文件名，jd.com就是爬蟲的域名範圍，我們建立爬蟲文件時一定要給一個域名範圍，防止爬蟲到一些亂七八糟的url上。

3、要把scrapy-redis官網下載的源碼中的scrapy_redis文件複製到項目當中去，因爲有很多訪問要調用裏面的函數

開始配置setting

1、首先要配置的第一步就是ROBOTSTXT_OBEY，這個要設置爲False，原本是爲True，如果這個爲True的話，就會在爬取網站之前會先去網站的根目錄下面尋找一個robots.txt文件，如果找不到就不會往下面運行

2、模擬瀏覽器，就是要設置User-Agent和一些請求頭

原本這些都是註釋起來的，現在把他們都打開然後把值改爲瀏覽器的值，這裏我們就不用設置cookies如果要設置cookies和代理ip就要在下載中間鍵裏面設置，也就是在middlewares.py這個文件裏面設置，今天爬蟲京東圖書不用設置這些，因爲京東的反爬機制做得不是那麼好，如果想加cookies和代理IP也可以自己添加

3、設置redis數據庫和持久化爬蟲的一些參數

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #定義一個去重的類，用來將url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #指定隊列
SCHEDULER_PERSIST = True  #將程序持久化保存

REDIS_URL = "redis://127.0.0.1:6379"

DUPEFILTER_CLASS是一個去重類，通過這個類就可以添加url，並把url進行去重，就是訪問過的url不再訪問，這裏給大家簡單地講解一些去重規則，首先通過加密方法將url地址加密成一串字符串然後作爲指紋，然後存儲進redis數據庫裏面，然後新的url地址也要通過加密然後成一串字符串，如果數據庫裏面有這一個指紋就不會訪問這個url地址了，如果沒有就訪問這個url，然後再把指紋存進數據庫

SCHEDULER 這個是一個隊列，通過隊列的方式把訪問url地址得到一個request對象，然後把request對象存進隊列中，然後一個一個再把request對象提取出來，就是這裏實現多線程爬蟲

SCHEDULER_PERSIST 這個就是要將程序持久化保存，如果一個正常的scrapy爬蟲，如果不傳遞這個參數，那麼當程序結束了，也會把數據庫清空，就達不到程序持久化保存，和斷點續爬的效果

REDIS_URL 這個就是redis的地址

4、如果想要保存數據還要把數據管道給開了

ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 300,
}

開始分析網站

首先我們進入到圖書分類的官網
https://book.jd.com/booksort.html

然後我們可以看到每個圖書大分類下面都有很多的小分類，然後我們可以按f12看一些數據在不在這個url地址裏面

然後我們可以看到數據都在裏面，包括大分類的名字和url地址和小分類的名字和url地址

提取圖書大分類和小分類

首先提取大分類的名字

all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #獲取大分類
        for i in all_list:
            item = JingdongItem()
            item['all_class'] = i.xpath("./a/text()").extract_first()

然後我們發現小分類是大分類的兄弟節點的下一個節點，那我們就可以這樣提取
然後提取小分類的名字和url地址

class_list = i.xpath("./following-sibling::dd[1]/em")  #用兄弟節點的方式來獲取同一節點的下一個節點
            for j in class_list:
                item["next_class_name"] = j.xpath("./a/text()").extract_first()  #獲取小分類的名字
                item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())

進入小分類的url裏面，獲取圖書部分信息

然後我們就進入到小分類的詳情頁裏面，查看裏面有多少數據是我們要的

然後我們再進入到網頁源碼裏面看一下有沒有我們想要的數據

然後就可以發現裏面有圖書的名字、url地址、作者名、出版社的名字、店鋪的名字和url地址、出版時間，這些數據都是我們想要，就先把他們提取出來，另外的圖書價格和評論數可以去圖書的詳情頁裏面獲得

    def next_parse(self,response):
        item = response.meta["item"]
        book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
        for i in book_list:
            try:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip()  #去掉空格和換行符
            except:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
            item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
            item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
            item['publisher_url'] ="https:"  + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
            try:
                item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
            except:
                item["publish_time"] = i.xpath(
                    "./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
            item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()

獲得圖書的價格和評論數

我們先分析一下價格和評論數在哪裏，經過我的分析，我發現價格和評論數在圖書的url裏面沒有，它是通過js生成的json數據，我們可以找一下它生成在哪裏，經過我分析發現價格在這個網址裏面https://p.3.cn/prices/mgets?type=1&skuIds=J_11892005346&pdtk=&pduid=1551774170386597393748&pdpin=&pdbp=0&callback=jQuery6622062&_=1560704913535
經過分析這個url地址有些參數是不需要的，最終的url是https://p.3.cn/prices/mgets?skuIds=J_11892005346
然後每本書的價格只要修改J_後面的參數就可以獲得每本書的價格，而這個參數在小分類的url地址裏面存在，只要把參數提取出來就可以獲得響應的書本的價格，裏面有書本的原價和購買價格

    def parse_dateli(self,response):
        item = response.meta["item"]
        js = json.loads(response.body)
        item['original_price'] = js[0]['m']
        item['price'] = js[0]['p']

提取評論數

書本的評論數同樣是生成了json數據，所以我們可以按照剛纔尋找價格的方法來尋找商品的評論數，這裏就不詳細說了，不懂也可以來問我

    def parse_comment(self,response):
        item = response.meta["item"]
        js = json.loads(response.text)
        item['comment'] = js['CommentsCount'][0]['CommentCount']

設置items

items是數據管道，數據管道的使用也是非常簡單，先在數據管道里面設定一些參數

import scrapy


class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    all_class=scrapy.Field()
    next_class_name = scrapy.Field()
    next_class_url = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    comment = scrapy.Field()
    price = scrapy.Field()
    publisher_name = scrapy.Field()
    publisher_url = scrapy.Field()
    publish_time = scrapy.Field()
    author = scrapy.Field()
    original_price = scrapy.Field()

然後在爬蟲文件裏面導入這個文件的JingdongItem類，然後用一個參數接收，比如

item = JingdongItem()

記得這個參數是在爬蟲文件裏面設置的，下面提供源碼，就可以知道在哪裏設置的

pepelines 的設置

pipelines主要是用來保存數據的，這些數據可以保存在mysql數據庫、mongodb數據庫，或者保存在csv文件中，我這裏是保存在csv文件上

import pandas as pd

class JingdongPipeline(object):
    def open_spider(self,spider):
        print("開始運行")
    def process_item(self, item, spider):
        print(item)
        data_list = []
        data_list.append(dict(item))
        # for i in self.list:
        #     if i not in self.data_list:
        #         self.data_list.append(i)
        data_frame = pd.DataFrame(data_list)
        # data_frame["id"] = self.id
        # data_frame.set_index("id",drop=True)
        data_frame.index.name = "id"
        pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
        print("寫入成功")
        return item
    def close_spider(self,spider):
        print("運行結束")

注意

注意兩點
1、域名範圍，這裏的小分類和圖書信息的url地址的域名都是不同的，所以要在上面的allowed_domains添加域名進去

2、yield scrapy.Request的使用，這裏面設置到一個參數就是meta，因爲要傳遞值給下面的函數接着使用，所以要傳遞這個參數，舉一個例子

yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})

這裏用到了copy模塊的一個方法，就是deepcopy，這裏要用deepcopy把數據給拷貝起來，然後傳遞給下面的函數，如果不把它給拷貝下來，就會導致數據錯亂，因爲scrapy是一個異步請求的過程，所以極有可能就是上面已經處理好第三個數據了，但是還沒開始傳遞第一個數據，就會導致數據交叉在一起，所以先把數據給拷貝下來就不會導致數據錯亂的情況發生

爬蟲文件源碼

import scrapy
from copy import deepcopy
import re
import json
from jingdong.items import JingdongItem
class JdbookSpider(scrapy.Spider):
    name = 'jdbook'
    allowed_domains = ['jd.com','p.3.cn','club.jd.com']
    start_urls = ['https://book.jd.com/booksort.html']

    def parse(self, response):
        all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #獲取大分類
        for i in all_list:
            item = JingdongItem()
            item['all_class'] = i.xpath("./a/text()").extract_first()
            class_list = i.xpath("./following-sibling::dd[1]/em")  #用兄弟節點的方式來獲取同一節點的下一個節點
            for j in class_list:
                item["next_class_name"] = j.xpath("./a/text()").extract_first()  #獲取小分類的名字
                item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
                yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})

    def next_parse(self,response):
        item = response.meta["item"]
        book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
        for i in book_list:
            try:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip()  #去掉空格和換行符
            except:
                item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
            item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
            item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
            item['publisher_url'] ="https:"  + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
            try:
                item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
            except:
                item["publish_time"] = i.xpath(
                    "./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
            item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
            data_sku = "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/@data-sku").extract())
            yield scrapy.Request(url="https://p.3.cn/prices/mgets?skuIds=J_{}".format(data_sku),callback=self.parse_dateli,meta={"item":deepcopy(item)})
        next_page_url = "https://list.jd.com"+ "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
        judge = "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
        if judge is not None:
            yield scrapy.Request(url=next_page_url,callback=self.next_parse,meta={"item":deepcopy(item)})
    def parse_dateli(self,response):
        item = response.meta["item"]
        js = json.loads(response.body)
        item['original_price'] = js[0]['m']
        item['price'] = js[0]['p']
        id = js[0]["id"]
        id = "".join(re.findall(r'\d\.*\d*',id))
        yield scrapy.Request(url="https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}".format(id),callback=self.parse_comment,meta={"item":deepcopy(item)})

    def parse_comment(self,response):
        item = response.meta["item"]
        js = json.loads(response.text)
        item['comment'] = js['CommentsCount'][0]['CommentCount']
        yield deepcopy(item)
        # print(deepcopy(item))

setting源碼

# -*- coding: utf-8 -*-

# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jingdong'

SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'

DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"  #定義一個去重的類，用來將url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler"   #指定隊列
SCHEDULER_PERSIST = True  #將程序持久化保存

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

LOG_LEVEL = "DEBUG"

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
  'Accept-Language': 'en',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'jingdong.pipelines.JingdongPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

REDIS_URL = "redis://127.0.0.1:6379"   #要把數據寫入redis數據庫還要添加這個參數

items源碼

import scrapy


class JingdongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    all_class=scrapy.Field()
    next_class_name = scrapy.Field()
    next_class_url = scrapy.Field()
    book_name = scrapy.Field()
    book_url = scrapy.Field()
    comment = scrapy.Field()
    price = scrapy.Field()
    publisher_name = scrapy.Field()
    publisher_url = scrapy.Field()
    publish_time = scrapy.Field()
    author = scrapy.Field()
    original_price = scrapy.Field()

pipelines源碼

import pandas as pd

class JingdongPipeline(object):
    def open_spider(self,spider):
        print("開始運行")
    def process_item(self, item, spider):
        print(item)
        data_list = []
        data_list.append(dict(item))
        # for i in self.list:
        #     if i not in self.data_list:
        #         self.data_list.append(i)
        data_frame = pd.DataFrame(data_list)
        # data_frame["id"] = self.id
        # data_frame.set_index("id",drop=True)
        data_frame.index.name = "id"
        pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
        print("寫入成功")
        return item
    def close_spider(self,spider):
        print("運行結束")

有什麼不懂的地方可以加我的QQ詢問 1693490575

scrapy-redis斷點續爬，持久化爬蟲和url去重，爬取京東圖書

scrapy

scrapy-redis

建立scrapy爬蟲項目

開始配置setting

開始分析網站

提取圖書大分類和小分類

進入小分類的url裏面，獲取圖書部分信息

獲得圖書的價格和評論數

提取評論數

設置items

pepelines 的設置

注意

爬蟲文件源碼

setting源碼

items源碼

pipelines源碼

有什麼不懂的地方可以加我的QQ詢問 1693490575

tensorflow多線程批量讀取文件

爬取豆瓣電影信息，再將豆瓣信息寫入csv文件和mongodb數據庫，再進行數據分析

入門egg.js

統計911不同月份中不同類型的緊急電話類型

《相見你》短評分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結