python爬蟲小白昇仙_6-----scrapy(爬取噹噹網數據)

使用scrapy爬取噹噹網的數據,輸入搜尋的關鍵字(如python、C++、java等),輸入查詢的頁數,獲取到書的名稱、作者、價錢、評論數等信息,並下載書籍相應圖片,畫水平條形圖直觀顯示熱度較高的書籍

涉及:

1. scrapy的使用

2. scrapy.FormRequest() 提交表單

3.  數據保存到mongodb,數據寫入.xlsx表格

4. 設置referer防止反爬

5. 使用ImagesPipeLine下載圖片

6. 獲取評論數前10的書籍,畫水平條形圖

 

詳細源碼:

entrypoint.py

from scrapy.cmdline import execute

execute(["scrapy","crawl","dangdang"])

items.py

import scrapy


class DangdangSpiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 書名
    book_name=scrapy.Field()
    # 作者
    author=scrapy.Field()
    # 出版社
    publisher=scrapy.Field()
    # 價格
    price=scrapy.Field()
    # 評論數
    comments_num=scrapy.Field()
    # 圖片url
    image_url=scrapy.Field()
    # 搜索內容key
    book_key=scrapy.Field()
dangdang.py
# -*- coding: utf-8 -*-
import scrapy
from lxml import etree
from DangDang_Spider.items import DangdangSpiderItem
class DangdangSpider(scrapy.Spider):
    name = 'dangdang'
    allowed_domains = ['dangdang.com']
    start_urls = 'http://search.dangdang.com/'

    total_comments_num_list=[]
    total_book_name_list=[]
    # 發起網頁請求,換頁僅改變了page_index的值
    def start_requests(self):
        self.key=input("請輸入查詢的書籍:")
        pages=input("請輸入希望查詢的總頁數:")
        while(pages.isdigit()==False or '.' in pages):
            pages = input("輸入錯誤,請輸入整數:")
        if  int(pages)<=0 or int(pages)>100:
            pages = input("輸入超出範圍(1-100),請重新輸入:")
        form_data={
            'key':self.key,
            'act':'input',
            'page_index':'1'
        }
        for i in range(int(pages)):
            form_data['page_index']=str(i+1)
            # 使用scrapy.FormRequest,可設置表單數據,默認method爲POST,可根據具體請求修改
            yield scrapy.FormRequest(self.start_urls,formdata=form_data,method='GET',callback=self.parse)

    # xpath提取數據
    def parse(self, response):
        xml=etree.HTML(response.text)
        book_name_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/a/@title')
        author_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="search_book_author"]/span[1]/a/@title')
        publisher_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="search_book_author"]/span[3]/a/@title')
        price_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="price"]/span[1]/text()')
        comments_num_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/p[@class="search_star_line"]/a/text()')
        image_url_list=xml.xpath('//div[@id="search_nature_rg"]/ul//li/a/img/@data-original')
        item = DangdangSpiderItem()
        item["book_name"] = book_name_list
        item['author'] = author_list
        item['publisher'] = publisher_list
        item['price'] = price_list
        item['comments_num'] = comments_num_list
        item['image_url']=image_url_list
        item['book_key']=self.key

        return item



settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for DangDang_Spider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'DangDang_Spider'

SPIDER_MODULES = ['DangDang_Spider.spiders']
NEWSPIDER_MODULE = 'DangDang_Spider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'DangDang_Spider (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'DangDang_Spider.middlewares.DangdangSpiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# 打開下載管道
DOWNLOADER_MIDDLEWARES = {
    'DangDang_Spider.middlewares.DangdangSpiderDownloaderMiddleware': 423,
    'DangDang_Spider.middlewares.DangdangSpiderRefererMiddleware':1
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'DangDang_Spider.pipelines.MongoPipeline': 300,    # 實現保存數據到mongodb
    'DangDang_Spider.pipelines.FilePipeline': 400,     # 實現保存數據到excel
    'DangDang_Spider.pipelines.SaveImagePipeline':450, # 調用scrapy內部ImagesPipeline實現圖片下載
    'DangDang_Spider.pipelines.PicturePipeline':500    # 統計評論數最高的10本書,畫圖
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# 使用下列,Scrapy會緩存你有的Requests!當你再次請求時,如果存在緩存文檔則返回緩存文檔,而不是去網站請求,這樣既加快了本地調試速度,也減輕了網站的壓力
HTTPCACHE_ENABLED = True
HTTPCACHE_EXPIRATION_SECS = 0
HTTPCACHE_DIR = 'httpcache'
HTTPCACHE_IGNORE_HTTP_CODES = []
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

# Mongodb參數配置 ip/port/數據庫名/集合名
MONGODB_HOST = '127.0.0.1'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'dangdang'
MONGODB_DOCNAME = 'dangdang_collection'

# 圖片存放根目錄
IMAGES_STORE='./book_image'
pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.utils.project import get_project_settings  #  獲取settings.py
import pymongo
from DangDang_Spider.items import DangdangSpiderItem

import openpyxl
import os

from scrapy.pipelines.images import ImagesPipeline
import scrapy
from scrapy.exceptions import DropItem
import matplotlib.pyplot as plt

# 保存數據到mongodb
class MongoPipeline(object):
    settings=get_project_settings()
    host = settings['MONGODB_HOST']
    port = settings['MONGODB_PORT']
    dbName = settings['MONGODB_DBNAME']
    collectionName = settings['MONGODB_DOCNAME']

    # 開始處理數據之前連接數據庫
    def open_spider(self,spider):
        # 創建連接
        self.client=pymongo.MongoClient(host=self.host,port=self.port)
        # 創建數據庫
        self.db=self.client[self.dbName]
        # 創建集合
        self.collection=self.db[self.collectionName]

    def process_item(self, item, spider):
        if isinstance(item,DangdangSpiderItem):
            # 處理數據,使每一組數據均包含應有信息
            book_name=item["book_name"]
            author=item['author']
            publisher=item['publisher']
            price=item['price']
            comments_num=item['comments_num']
            for book,au,pu,pr,co in zip(book_name,author,publisher,price,comments_num):
                data = {}
                data['book_name']=book
                data['author']=au
                data['publisher']=pu
                data['price']=pr
                data['comments_num']=co
                self.collection.insert_one(data)
            return item

    # 數據處理完之後關閉數據庫
    def close_spider(self,spider):
        self.client.close()


# 保存數據到表格
class FilePipeline(object):
    def __init__(self):
        if os.path.exists("噹噹.xlsx"):
            self.wb = openpyxl.load_workbook("噹噹.xlsx")  # 打開已有文件
            # 創建一張新表
            # ws=wb.create_sheet()
            self.ws = self.wb["Sheet"]  # 通過名字選擇表
        else:
            self.wb = openpyxl.Workbook()  # 新建Excel 實例化
            self.ws = self.wb.active  # 激活 worksheet
        self.ws.append(['書名','作者','出版社','價格','評論數'])
        self.ws.column_dimensions['A'].width = 55  # 列寬
        self.ws.column_dimensions['B'].width = 55
        self.ws.column_dimensions['C'].width = 25
        self.ws.column_dimensions['D'].width = 10
        self.ws.column_dimensions['E'].width = 15

    def process_item(self,item,spider):
        # 獲取各數據列表的大小,進行排序,得到列表數據最少的長度,防止索引超出
        data_count = [len(item['book_name']), len(item['author']), len(item['publisher']), len(item['price']),
                      len(item['comments_num']), ]
        # sorted列表排序,key=絕對按什麼排序,reverse=True:降序;False:升序
        data_count_least = sorted(data_count, key=lambda data_num: int(data_num), reverse=False)[0]
        for i in range(data_count_least):
            line = [str(item['book_name'][i]), str(item['author'][i]), str(item['publisher'][i]), str(item['price'][i]), str(item['comments_num'][i])]
            self.ws.append(line)
        self.wb.save("噹噹.xlsx")
        return item

# ImagesPipeLine下載圖片
class SaveImagePipeline(ImagesPipeline):
    # 下載圖片
    def get_media_requests(self, item, info):
        # 循環下載圖片,meta傳遞數據(搜索的書關鍵字,書名,文件的後綴),根據url準確獲取其文件類型
        for i in range(len(item['image_url'])):
            yield scrapy.Request(url=item['image_url'][i],meta={'book_key':item['book_key'],'name':item['book_name'][i],'name_suffix':item['image_url'][i].split('.')[-1]})

    # 是否下載成功
    def item_completed(self, results, item, info):
        # results是一個元組,第一個元素是布爾類型,false:失敗   true:成功
        if not results[0][0]:
            raise DropItem('下載失敗')   # 若結果爲false,異常處理,丟棄item
        return item

    # 圖片存放,文件重命名
    def file_path(self, request, response=None, info=None):
        # 獲取meta傳遞的數據構建書名,如‘xxx.jpg’,‘xxx.png’   .replace('/','_')替換名稱中的‘/’,防止其識別成文件夾
        book_name=request.meta['name'].replace('/','_')+'.'+request.meta['name_suffix']
        # 按搜索類型分別存到對應的文件夾下
        file_name=u'{0}/{1}'.format(request.meta['book_key'],book_name)
        return file_name

# 提取評論數前10的書,並畫水平條形圖
class PicturePipeline(object):
    comments_num=[]
    book_name=[]
    book_name_sorted=[]
    comments_num_ten=[]
    def process_item(self,item,spider):
        self.get_plot(item['book_name'],item['comments_num'])
        return item

    def get_plot(self, name_list, comments_num_list):
        # 獲取所有的數據
        for comment,name in zip(comments_num_list,name_list):
            self.comments_num.append(comment)
            self.book_name.append(name)
        # 將書名和評論數組成字典
        book_dict= dict(zip(self.comments_num,self.book_name))
        # 按照字典的鍵進行倒序排序
        comments_num_sorted_list=sorted(book_dict.keys(),key=lambda num:int(num.split('條')[0]),reverse=True)
        # 獲取評論數最高的10本書
        for i in range(10):
            for key in book_dict.keys():
                if comments_num_sorted_list[i]==key:
                    self.book_name_sorted.append(book_dict[key])
                    continue

        # 使用matplotlib.pyplot畫水平條形圖
        plt.rcParams['font.sans-serif'] = ['SimHei']  # 用黑體顯示中文
        plt.rcParams['axes.unicode_minus'] = False  # 正常顯示負號
        # 默認的像素:[6.0,4.0],分辨率爲100,圖片尺寸爲 600*400 ;  修改後圖片尺寸爲:2000*800
        plt.rcParams['figure.figsize']=(10.0,4.0)   # 設置figure_size尺寸
        plt.rcParams['figure.dpi'] = 200  # 分辨率
        for i in range(10):
            self.comments_num_ten.append(int(comments_num_sorted_list[i].split('條')[0]))
        # width列表元素類型不能爲str  故此轉換爲整形:int(comments_num_sorted_list[i].split('條')[0])
        plt.barh(range(10),width=self.comments_num_ten,label='評論數',color='red',alpha=0.8,height=0.7) # 從下往上畫
        # 在柱狀圖上顯示具體數值, ha參數控制水平對齊方式, va控制垂直對齊方式
        for y,x in enumerate(self.comments_num_ten):
            plt.text(x+1500,y-0.2,'%s'%x,ha='center',va='bottom')
        # 爲Y軸設置座標值
        plt.yticks(range(10),self.book_name_sorted,size=8)
        # 爲座標軸設置名稱
        plt.ylabel('書名')
        # 設置標題
        plt.title('評論數前10的書籍')
        # 顯示圖例
        plt.legend()
        plt.show()

middlewares.py   

from scrapy import signals

# 設置referer防止反爬
class DangdangSpiderRefererMiddleware(object):
    @classmethod
    def process_request(self,request,spider):
        referer=request.url
        if referer:
            request.headers['referer']=referer

tips:

1. 自定義的pipeline,需在settings.py中進行設置

ITEM_PIPELINES = {
    'DangDang_Spider.pipelines.MongoPipeline': 300,    # 實現保存數據到mongodb
    'DangDang_Spider.pipelines.FilePipeline': 400,     # 實現保存數據到excel
    'DangDang_Spider.pipelines.SaveImagePipeline':450, # 調用scrapy內部ImagesPipeline實現圖片下載
    'DangDang_Spider.pipelines.PicturePipeline':500    # 統計評論數最高的10本書,畫圖
}

2. 使用ImagesPipeLine下載圖片時,需在settings.py中設置圖片存放目錄

# 圖片存放根目錄
IMAGES_STORE='./book_image'

3. 設置referer防止反爬,需在settings.py中進行設置,其運行級別設爲1,優先執行

# 打開下載管道
DOWNLOADER_MIDDLEWARES = {
    'DangDang_Spider.middlewares.DangdangSpiderDownloaderMiddleware': 423,
    'DangDang_Spider.middlewares.DangdangSpiderRefererMiddleware':1
}

4. 畫水平條形圖matplotlib.pyplot.barh(y, width,label=, height=0.8,color='red',align='center')

width:代表條形圖的寬度,即每個條形圖具體的數值,其值若爲str會出現錯誤,需進行轉化

5. 圖片在進行存儲時需指定其文件類型(.jpg/.png等根據實際獲取),避免圖片保存後未能識別出文件類型,導致查看繁瑣

6. 進行圖片保存時,發現文件存儲錯亂(圖片應該都在C++這個文件夾下,結果莫名多出幾個文件夾),debug發現文件名中存在‘/’,系統進行了識別,在此做了簡單處理,消除此現象   

運行結果:

1. 項目

2. 數據寫入到表格

3. 下載圖片

4. 畫水平條形圖

遺留問題:

在執行PicturePipeline畫圖時,會報錯:ValueError: shape mismatch: objects cannot be broadcast to a single shape

暫未找到原因,有大神瞭解的麻煩告知,感謝

 

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章