scrapy
scrapy框架是專門爲python爬蟲所設計的框架,它可以實現多線程爬蟲,異步請求運行,雖然不用scrapy框架也可以實現多線程爬蟲,但是功能非常的雞肋,也比較麻煩,而scrapy就可以很簡單的實現了多線程爬蟲,還有許多強大的功能,不懂的也可以取scrapy中文網上面瞭解 https://yiyibooks.cn/zomin/Scrapy15/index.html
scrapy-redis
scrapy-redis是基於scrapy開發的一個功能,它可以實現斷點續爬,url去重,持久化爬蟲,分佈式爬蟲,而且使用也非常簡單,要先去官網把源碼下載下來https://github.com/rmax/scrapy-redis,而主要文件只有三個
而這次比較重要的文件就是scrapy_redis這個文件,下面進行演示scrapy-redis爬取京東圖書信息
建立scrapy爬蟲項目
1、首先在你要建立的目錄下面打開電腦終端,就是用cmd打開,然後建立爬蟲項目,在電腦終端裏面寫上 scrapy startproject jingdong 後面這個jingdong就是項目名稱
2、然後cd進去這個項目裏面建立爬蟲文件,在終端上面寫 scrapy genspider jdbook jd.com 其中jdbook 就是爬蟲文件名,jd.com就是爬蟲的域名範圍,我們建立爬蟲文件時一定要給一個域名範圍,防止爬蟲到一些亂七八糟的url上。
3、要把scrapy-redis官網下載的源碼中的scrapy_redis文件複製到項目當中去,因爲有很多訪問要調用裏面的函數
開始配置setting
1、首先要配置的第一步就是ROBOTSTXT_OBEY,這個要設置爲False,原本是爲True,如果這個爲True的話,就會在爬取網站之前會先去網站的根目錄下面尋找一個robots.txt文件,如果找不到就不會往下面運行
2、模擬瀏覽器,就是要設置User-Agent和一些請求頭
原本這些都是註釋起來的,現在把他們都打開然後把值改爲瀏覽器的值,這裏我們就不用設置cookies如果要設置cookies和代理ip就要在下載中間鍵裏面設置,也就是在middlewares.py這個文件裏面設置,今天爬蟲京東圖書不用設置這些,因爲京東的反爬機制做得不是那麼好,如果想加cookies和代理IP也可以自己添加
3、設置redis數據庫和持久化爬蟲的一些參數
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #定義一個去重的類,用來將url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #指定隊列
SCHEDULER_PERSIST = True #將程序持久化保存
REDIS_URL = "redis://127.0.0.1:6379"
DUPEFILTER_CLASS是一個去重類,通過這個類就可以添加url,並把url進行去重,就是訪問過的url不再訪問,這裏給大家簡單地講解一些去重規則,首先通過加密方法將url地址加密成一串字符串然後作爲指紋,然後存儲進redis數據庫裏面,然後新的url地址也要通過加密然後成一串字符串,如果數據庫裏面有這一個指紋就不會訪問這個url地址了,如果沒有就訪問這個url,然後再把指紋存進數據庫
SCHEDULER 這個是一個隊列,通過隊列的方式把訪問url地址得到一個request對象,然後把request對象存進隊列中,然後一個一個再把request對象提取出來,就是這裏實現多線程爬蟲
SCHEDULER_PERSIST 這個就是要將程序持久化保存,如果一個正常的scrapy爬蟲,如果不傳遞這個參數,那麼當程序結束了,也會把數據庫清空,就達不到程序持久化保存,和斷點續爬的效果
REDIS_URL 這個就是redis的地址
4、如果想要保存數據還要把數據管道給開了
ITEM_PIPELINES = {
'jingdong.pipelines.JingdongPipeline': 300,
}
開始分析網站
首先我們進入到圖書分類的官網
https://book.jd.com/booksort.html
然後我們可以看到每個圖書大分類下面都有很多的小分類,然後我們可以按f12看一些數據在不在這個url地址裏面
然後我們可以看到數據都在裏面,包括大分類的名字和url地址和小分類的名字和url地址
提取圖書大分類和小分類
首先提取大分類的名字
all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #獲取大分類
for i in all_list:
item = JingdongItem()
item['all_class'] = i.xpath("./a/text()").extract_first()
然後我們發現小分類是大分類的兄弟節點的下一個節點,那我們就可以這樣提取
然後提取小分類的名字和url地址
class_list = i.xpath("./following-sibling::dd[1]/em") #用兄弟節點的方式來獲取同一節點的下一個節點
for j in class_list:
item["next_class_name"] = j.xpath("./a/text()").extract_first() #獲取小分類的名字
item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
進入小分類的url裏面,獲取圖書部分信息
然後我們就進入到小分類的詳情頁裏面,查看裏面有多少數據是我們要的
然後我們再進入到網頁源碼裏面看一下有沒有我們想要的數據
然後就可以發現裏面有圖書的名字、url地址、作者名、出版社的名字、店鋪的名字和url地址、出版時間,這些數據都是我們想要,就先把他們提取出來,另外的圖書價格和評論數可以去圖書的詳情頁裏面獲得
def next_parse(self,response):
item = response.meta["item"]
book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
for i in book_list:
try:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip() #去掉空格和換行符
except:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
item['publisher_url'] ="https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
try:
item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
except:
item["publish_time"] = i.xpath(
"./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
獲得圖書的價格和評論數
我們先分析一下價格和評論數在哪裏,經過我的分析,我發現價格和評論數在圖書的url裏面沒有,它是通過js生成的json數據,我們可以找一下它生成在哪裏,經過我分析發現價格在這個網址裏面https://p.3.cn/prices/mgets?type=1&skuIds=J_11892005346&pdtk=&pduid=1551774170386597393748&pdpin=&pdbp=0&callback=jQuery6622062&_=1560704913535
經過分析這個url地址有些參數是不需要的,最終的url是https://p.3.cn/prices/mgets?skuIds=J_11892005346
然後每本書的價格只要修改J_後面的參數就可以獲得每本書的價格,而這個參數在小分類的url地址裏面存在,只要把參數提取出來就可以獲得響應的書本的價格,裏面有書本的原價和購買價格
def parse_dateli(self,response):
item = response.meta["item"]
js = json.loads(response.body)
item['original_price'] = js[0]['m']
item['price'] = js[0]['p']
提取評論數
書本的評論數同樣是生成了json數據,所以我們可以按照剛纔尋找價格的方法來尋找商品的評論數,這裏就不詳細說了,不懂也可以來問我
def parse_comment(self,response):
item = response.meta["item"]
js = json.loads(response.text)
item['comment'] = js['CommentsCount'][0]['CommentCount']
設置items
items是數據管道,數據管道的使用也是非常簡單,先在數據管道里面設定一些參數
import scrapy
class JingdongItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
all_class=scrapy.Field()
next_class_name = scrapy.Field()
next_class_url = scrapy.Field()
book_name = scrapy.Field()
book_url = scrapy.Field()
comment = scrapy.Field()
price = scrapy.Field()
publisher_name = scrapy.Field()
publisher_url = scrapy.Field()
publish_time = scrapy.Field()
author = scrapy.Field()
original_price = scrapy.Field()
然後在爬蟲文件裏面導入這個文件的JingdongItem類,然後用一個參數接收,比如
item = JingdongItem()
記得這個參數是在爬蟲文件裏面設置的,下面提供源碼,就可以知道在哪裏設置的
pepelines 的設置
pipelines主要是用來保存數據的,這些數據可以保存在mysql數據庫、mongodb數據庫,或者保存在csv文件中,我這裏是保存在csv文件上
import pandas as pd
class JingdongPipeline(object):
def open_spider(self,spider):
print("開始運行")
def process_item(self, item, spider):
print(item)
data_list = []
data_list.append(dict(item))
# for i in self.list:
# if i not in self.data_list:
# self.data_list.append(i)
data_frame = pd.DataFrame(data_list)
# data_frame["id"] = self.id
# data_frame.set_index("id",drop=True)
data_frame.index.name = "id"
pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
print("寫入成功")
return item
def close_spider(self,spider):
print("運行結束")
注意
注意兩點
1、域名範圍,這裏的小分類和圖書信息的url地址的域名都是不同的,所以要在上面的allowed_domains添加域名進去
2、yield scrapy.Request的使用,這裏面設置到一個參數就是meta,因爲要傳遞值給下面的函數接着使用,所以要傳遞這個參數,舉一個例子
yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})
這裏用到了copy模塊的一個方法,就是deepcopy,這裏要用deepcopy把數據給拷貝起來,然後傳遞給下面的函數,如果不把它給拷貝下來,就會導致數據錯亂,因爲scrapy是一個異步請求的過程,所以極有可能就是上面已經處理好第三個數據了,但是還沒開始傳遞第一個數據,就會導致數據交叉在一起,所以先把數據給拷貝下來就不會導致數據錯亂的情況發生
爬蟲文件源碼
import scrapy
from copy import deepcopy
import re
import json
from jingdong.items import JingdongItem
class JdbookSpider(scrapy.Spider):
name = 'jdbook'
allowed_domains = ['jd.com','p.3.cn','club.jd.com']
start_urls = ['https://book.jd.com/booksort.html']
def parse(self, response):
all_list = response.xpath("//*[@id='booksort']/div[2]/dl/dt") #獲取大分類
for i in all_list:
item = JingdongItem()
item['all_class'] = i.xpath("./a/text()").extract_first()
class_list = i.xpath("./following-sibling::dd[1]/em") #用兄弟節點的方式來獲取同一節點的下一個節點
for j in class_list:
item["next_class_name"] = j.xpath("./a/text()").extract_first() #獲取小分類的名字
item["next_class_url"] = "https:" + "".join(j.xpath("./a/@href").extract_first())
yield scrapy.Request(url=item["next_class_url"],callback=self.next_parse,meta={"item":deepcopy(item)})
def next_parse(self,response):
item = response.meta["item"]
book_list = response.xpath("//div[@id='plist']/ul[@class='gl-warp clearfix']/li")
for i in book_list:
try:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first().strip() #去掉空格和換行符
except:
item['book_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/em/text()").extract_first()
item['book_url'] = "https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-name']/a/@href").extract())
item['publisher_name'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/text()").extract_first()
item['publisher_url'] ="https:" + "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-store']/a/@href").extract()) if item['publisher_name'] is not None else None
try:
item["publish_time"] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first().strip()
except:
item["publish_time"] = i.xpath(
"./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-date']/text()").extract_first()
item['author'] = i.xpath("./div[@class='gl-i-wrap j-sku-item']/div[@class='p-bookdetails']/span[@class='p-bi-name']/span/a/text()").extract()
data_sku = "".join(i.xpath("./div[@class='gl-i-wrap j-sku-item']/@data-sku").extract())
yield scrapy.Request(url="https://p.3.cn/prices/mgets?skuIds=J_{}".format(data_sku),callback=self.parse_dateli,meta={"item":deepcopy(item)})
next_page_url = "https://list.jd.com"+ "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
judge = "".join(response.xpath("//*[@id='J_bottomPage']/span[@class='p-num']/a[@class='pn-next']/@href").extract())
if judge is not None:
yield scrapy.Request(url=next_page_url,callback=self.next_parse,meta={"item":deepcopy(item)})
def parse_dateli(self,response):
item = response.meta["item"]
js = json.loads(response.body)
item['original_price'] = js[0]['m']
item['price'] = js[0]['p']
id = js[0]["id"]
id = "".join(re.findall(r'\d\.*\d*',id))
yield scrapy.Request(url="https://club.jd.com/comment/productCommentSummaries.action?referenceIds={}".format(id),callback=self.parse_comment,meta={"item":deepcopy(item)})
def parse_comment(self,response):
item = response.meta["item"]
js = json.loads(response.text)
item['comment'] = js['CommentsCount'][0]['CommentCount']
yield deepcopy(item)
# print(deepcopy(item))
setting源碼
# -*- coding: utf-8 -*-
# Scrapy settings for jingdong project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jingdong'
SPIDER_MODULES = ['jingdong.spiders']
NEWSPIDER_MODULE = 'jingdong.spiders'
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" #定義一個去重的類,用來將url去重
SCHEDULER = "scrapy_redis.scheduler.Scheduler" #指定隊列
SCHEDULER_PERSIST = True #將程序持久化保存
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
LOG_LEVEL = "DEBUG"
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Language': 'en',
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'jingdong.middlewares.JingdongSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'jingdong.middlewares.JingdongDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'jingdong.pipelines.JingdongPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
REDIS_URL = "redis://127.0.0.1:6379" #要把數據寫入redis數據庫還要添加這個參數
items源碼
import scrapy
class JingdongItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
all_class=scrapy.Field()
next_class_name = scrapy.Field()
next_class_url = scrapy.Field()
book_name = scrapy.Field()
book_url = scrapy.Field()
comment = scrapy.Field()
price = scrapy.Field()
publisher_name = scrapy.Field()
publisher_url = scrapy.Field()
publish_time = scrapy.Field()
author = scrapy.Field()
original_price = scrapy.Field()
pipelines源碼
import pandas as pd
class JingdongPipeline(object):
def open_spider(self,spider):
print("開始運行")
def process_item(self, item, spider):
print(item)
data_list = []
data_list.append(dict(item))
# for i in self.list:
# if i not in self.data_list:
# self.data_list.append(i)
data_frame = pd.DataFrame(data_list)
# data_frame["id"] = self.id
# data_frame.set_index("id",drop=True)
data_frame.index.name = "id"
pd.DataFrame.to_csv(data_frame, "I:/crack/DATA/jdbook.csv", mode='a', index=False,header=0,encoding="utf_8_sig")
print("寫入成功")
return item
def close_spider(self,spider):
print("運行結束")