【Scrapy學習心得】爬蟲實戰二（異步保存數據到數據庫）

聲明：僅供技術交流，請勿用於非法用途，如有其它非法用途造成損失，和本博客無關

一、配置環境

python3.7
pycharm
Scrapy1.7.3
win10
pymysql

二、準備工作

在cmd命令行中進入需要創建項目的目錄運行scrapy startproject hehe
創建成功後繼續執行cd hehe
然後執行scrapy genspider jd jd.com
最後在spider文件夾下可以看到剛創建的jd.py爬蟲文件

三、分析網頁

可以看到，所有的大分類都在一起，右鍵檢查一下很容易找到這些大分類名稱以及它的url地址，然後在大分類下小分類名稱以及它的url地址也可以輕鬆找到：

右鍵點檢查也可以輕鬆找到對應的小分類的url地址。然後我們再來看看具體的商品信息，這裏我只想要爬取這本書的名字和它的價格信息

但是在爬取的過程中你會遇到一個問題，就是價格爬取不了，爲什麼呢，後來我發現這個價格信息是通過AJAX渲染的，而不是HTML，於是在返回的response中不會顯示價格，也就爬取不了了，那怎麼辦呢，不急，下面分析了這些價格是怎麼被渲染出來的。

右鍵打開檢查，在控制檯中隨便找一本書的價格輸入並按回車，這時會出現下面的兩個東西，如下圖第一個
然後發現它是通過請求https://p.3.cn/prices/mgets?callback=jQuery9186544&type=1&area=19_1643_36176_0&skuIds=J_54816170278%2CJ_55178298366%2CJ_52359934836%2CJ_56652847646%2CJ_27852197042&pdbp=0&pdtk=&pdpin=&pduid=1565946264342744483359&source=list_pc_front&_=1569742616673這一個網頁來獲取價格的，然後通過刪減這個url的不必要的參數，最終得到一個商品它的價格對應請求的url爲：https://p.3.cn/prices/mgets?skuIds=J_54816170278，不同商品的url只是後面那一串數字不一樣而已，那麼那一串數字怎麼取到呢
可以發現商品本身的一個標籤裏就有寫着，如下圖第二個，那我們只要先取到這個值，再構造url請求就行了

所以總共我要爬取的內容有：

大分類的名稱以及其url
小分類的名稱以及其url
書名以及價格

查找元素的那些操作我就不放上來了，因爲沒什麼難度的，會來學scrapy框架的同學肯定是跟我一樣那些什麼requests啊，urllib啊，selenium啊等等都是用膩了纔來的，是吧

四、爬取數據

下面直接放上jd.py的代碼：

# -*- coding: utf-8 -*-
import scrapy
from copy import deepcopy
import json
import urllib

class JdSpider(scrapy.Spider):
    name = 'jd'
    allowed_domains = ['jd.com','p.3.cn']  #域名，注意這裏要添加請求價格的網站域名
    start_urls = ['https://list.jd.com/list.html?cat=1713,3258']

    def parse(self, response):
        type_name_list=response.xpath('//ul[@class="menu-drop-list"]/li')
        for type_name in type_name_list:
            item={}  #自己定義了一個item，這次沒有用到item.py
            item['type_name']=type_name.xpath('./a/text()').get()   #大分類名稱
            item['type_name_url'] = type_name.xpath('./a/@href').get()  #大分類url
            #補全url
            if item['type_name_url'] is not None:
                item['type_name_url']=urllib.parse.urljoin(response.url,item['type_name_url'])
            yield scrapy.Request(
                item['type_name_url'],
                callback=self.parse_type_list,
                meta={'item':deepcopy(item)}  #這裏要用deepcopy，這樣item纔不會在前後兩次請求中被修改
            )

    def parse_type_list(self,response):
        item=response.meta['item']  #獲取傳過來的item
        fenlei_list=response.xpath('//div[@id="J_selectorCategory"]//ul/li')
        for fenlei in fenlei_list:
            item['fenlei_name']=fenlei.xpath('./a/@title').get() #小分類名稱
            item['fenlei_url']=fenlei.xpath('./a/@href').get()   #小分類url
            #補全url
            if item['fenlei_url'] is not None:
                item['fenlei_url']=urllib.parse.urljoin(response.url,item['fenlei_url'])
            yield scrapy.Request(
                item['fenlei_url'],
                callback=self.parse_fenlei_list,
                meta={'item':deepcopy(item)}  #注意，這裏同樣需要使用deepcopy
            )

    def parse_fenlei_list(self,response):
        item=response.meta['item']
        book_list=response.xpath('//ul[@class="gl-warp clearfix"]/li')
        for book in book_list:
            item['book_name']=book.xpath('.//div[@class="p-name"]/a/em/text()').get().strip() #書名
            item['book_author']=book.xpath('.//span[@class="author_type_1"]/a/text()').getall() #書的作者
            sku=book.xpath('./div[1]/@data-sku').get()  #每本書的那個標籤碼
            #構造價格請求url
            if sku is not None:
                item['book_price_url']="https://p.3.cn/prices/mgets?skuIds=J_{}".format(sku)
                yield scrapy.Request(
                    item['book_price_url'],
                    callback=self.parse_book_price,
                    meta={'item':deepcopy(item)}  #注意。。。
                )

    def parse_book_price(self,response):
        item=response.meta['item']
        item['book_price']=json.loads(response.body.decode())[0]['op'] #價格
        # print(item)
        yield item  #交給pipeline進行保存數據

五、保存數據（同步存入數據庫）

保存數據是交給pipeline.py文件進行操作的

這裏我把數據保存到了mysql數據庫中，如果沒有安裝pymysql的，可以在cmd命令下執行pip install pymysql來安裝一下即可
然後pipeline.py文件的代碼如下：

import pymysql

conn=pymysql.connect(host='localhost',user='yourroot',password='yourpassword',database='yourdatabase',charset='utf8')
cursor=conn.cursor()

class HehePipeline(object):
    def process_item(self, item, spider):
        if spider.name == 'jd':
            type_name=item['type_name']
            fenlei_name=item['fenlei_name']
            book_name=item['book_name']
            book_price=item['book_price']
            sql='insert into jd3(type_name,fenlei_name,book_name,book_price) values(%s,%s,%s,%s)'
            cursor.execute(sql,[type_name,fenlei_name,book_name,book_price])
            conn.commit()

現在我們的爬蟲大致已經是寫完了，不過我還要修改一下setting.py文件的一些設置，需要增加的語句有：

LOG_LEVEL='WARNING' #設置日誌輸出級別
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'  #設置請求頭
ROBOTSTXT_OBEY = False  #把這個設置成False，就不會去請求網頁的robots.txt，因爲不改爲False的話，scrapy就會去訪問該網站的robots.txt協議，如果網站沒有這個協議，那麼它就不會去訪問該網站，就會跳過，進而爬不到數據
ITEM_PIPELINES = {
   'hehe.pipelines.HehePipeline': 300,
}

最後在cmd中先進入到這個項目的根目錄下，即有scrapy.cfg文件的目錄下，然後輸入並運行scrapy crawl jd，最後靜靜等待就行了

打開數據庫你會看到：

六、保存數據（異步存入數據庫）

異步存儲到數據庫中比同步會快很多（提升1倍的速度左右），要用到twisted模塊下的adbapi，因爲在安裝scrapy時就已經安裝好twisted模塊，因此可以直接導入就行了，下面只需更改pipeline.py文件：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql
from pymysql import cursors  #導入mysql的遊標類
from twisted.enterprise import adbapi

class XixiPipeline(object):
    def __init__(self):
        dbparams={
            'host' : 'localhost',
            'user' : 'yourroot',
            'password' : 'yourpassword',
            'database' : 'yourdatabase',
            'charset' : 'utf8',
            'cursorclass' : cursors.DictCursor #這裏需要傳多一個參數：遊標的一個參數給它
        }
        self.adpool=adbapi.ConnectionPool('pymysql',**dbparams)
        self._sql=None

    @property  #意思是把下面sql這個函數當作是一個變量，即相當於聲明一個sql變量一樣
    def sql(self):
        if not self._sql:
            self._sql='insert into jd2(type_name,fenlei_name,book_name,book_price) values(%s,%s,%s,%s)'
        return self._sql

    def process_item(self,item,spider):
        defer=self.adpool.runInteraction(self.insert_item,item)  #需要傳一個真正導入數據庫操作的函數給它，不然跟同步下載一樣
        defer.addErrback(self.handle_error,item,spider)  #添加一個接收錯誤信息的函數

    def insert_item(self,cursor,item):
        cursor.execute(self.sql,[item['type_name'],item['fenlei_name'],item['book_name'],item['book_price']])

    def handle_error(self,error,item,spider):
        print('-' * 30)
        print('Error:',error)
        print('-' * 30)

同樣需要在setting.py文件中設置修改pipeline才能把這個新的pipeline給生效，如下添加一行並把之前的給註釋掉：

ITEM_PIPELINES = {
   # 'hehe.pipelines.HehePipeline': 300,
   'hehe.pipelines.XixiPipeline': 300,
}

那麼這次的爬蟲就到這裏結束了

寫在最後

總結這次的爬蟲呢，我感覺是提升了一個檔次的，之前並沒有瞭解和實現過異步保存數據到數據庫當中，也體會到了異步的厲害之處。說一下，同步保存數據我用了差不多1個鐘，而用異步保存數據我只用了20分鐘左右。總共數據有4萬7千多條數據。因爲scrapy框架本身就是採用多協程來處理的了，已經比一般的爬蟲要快，那麼再加上異步保存數據，可想而知，簡直不要太爽！！

【Scrapy學習心得】爬蟲實戰二（異步保存數據到數據庫）

【Scrapy學習心得】爬蟲實戰二（異步保存數據到數據庫）

目錄

一、配置環境

二、準備工作

三、分析網頁

四、爬取數據

五、保存數據（同步存入數據庫）

六、保存數據（異步存入數據庫）

寫在最後

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

python多線程爬取加密後ts文件，解密後合成mp4視頻

python多線程爬取ts文件併合成mp4視頻

python爬取Instagram上偶像的帖子（包括圖片和視頻）

【Scrapy學習心得】添加隨機用戶代理

python+appium爬取微信運動數據，並分析好友的日常步數情況

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結