scrapy框架——持久化存儲

文章目錄

磁盤文件

基於終端指令

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    #allowed_domains = ['www.qiushibaike.com/text']
    start_urls = ['https://www.qiushibaike.com/text/']
    def parse(self, response):
        # 建議大家使用xpath進行解析(框架集成了xpath解析的接口)
        div_list = response.xpath("//div[@id='content-left']/div")
        # 存儲到的解析到的頁面數據
        data_list = []
        for div in div_list:
            author = div.xpath('./div/a[2]/h2/text()').extract_first()
            content = div.xpath(".//div[@class='content']/span/text()").extract_first()
            #print(author+'\n'+content)
            dict = {'author':author,'content':content}
            data_list.append(dict)
        return data_list

scrapy crawl qiubai -o qiubai.csv --nolog

亂碼問題，多空行解決：
https://blog.csdn.net/Light__1024/article/details/88655333

scrapy xpath()中的/與//的區別
https://blog.csdn.net/changer_WE/article/details/84553986

流程總結

返回一個可迭代類型的對象（存儲解析到的頁面內容）
使用終端指令 scrapy crawl 爬蟲文件名稱 –o 磁盤文件.後綴

基於管道

就是利用items對象傳遞數據

test.py

import scrapy
from first.items import FirstItem


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.qiushibaike.com/text']

    def parse(self, response):
        div_list = response.xpath("//div[@id='content-left']/div")

        for div in div_list:
            author = div.xpath('./div/a[2]/h2/text()').extract_first()
            content = div.xpath(".//div[@class='content']/span/text()").extract_first()
            # 1.將解析到的數據值（author和content）存儲到items對象
            # 所以items導入，導入之前定義字段。
            items=FirstItem()
            items['author']=author
            items['content']=content
            #提交給管道
            yield items

items.py

import scrapy


class FirstItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 定義存儲字段
    author = scrapy.Field()
    content = scrapy.Field()

pipelines.py



class FirstPipeline(object):
    # 2、該進：該方法只在爬蟲時候打開一次
    def open_spider(self,spider):
        self.f=open('./qiubai_pipe.txt','w',encoding='utf-8')


    # 1、該方法可以接收爬蟲提交的item對象
    # 每次調用管道都會進行數據讀取的操作，

    def process_item(self, item, spider):
        author = item['author']
        content  = item['content']

        # # 持久化存儲io操作,如果只有這一步最後文件內容只有最後一條
        # with open('./qiubai_pipe.txt','w',encoding='utf-8')as f:
        #     f.write(author+':'+content+'\n\n\n')
        # return item

        self.f.write(author+':'+content+'\n\n\n')
        return item

    def close_spider(self,spider):
        # 3、爬蟲結束時候執行
        self.f.close()

settings裏取消註釋67行

ITEM_PIPELINES = {
   'first.pipelines.FirstPipeline': 300,
}

流程總結：

items對象定義字段
test文件導包，生成對象，存儲，傳給管道 yield items
pipelines文件內定義三個方法，開啓，寫入，關閉文件。
配置文件取消註釋

數據庫

mysql:

pipelines.py

import pymysql


class FirstPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        # 連接數據庫
        self.conn = pymysql.connect(
            host='127.0.0.1',
            port=3306,
            user='root',
            password='123',
            db='db2',
            charset='utf8')

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        sql = 'insert into test values ("%s","%s")'%(author,content)
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()

test.py

import scrapy
from first.items import FirstItem


class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.qiushibaike.com/text']

    def parse(self, response):
        div_list = response.xpath("//div[@id='content-left']/div")

        for div in div_list:
            author = div.xpath('./div/a[2]/h2/text()').extract_first()
            content = div.xpath(".//div[@class='content']/span/text()").extract_first()

            items = FirstItem()
            items['author'] = author
            items['content'] = content

            yield items

items.py


import scrapy


class FirstItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 定義存儲字段
    author = scrapy.Field()
    content = scrapy.Field()

流程總結：

pymysql操作mysql的語法！！
流程和管道存文件一致

redis:

pipelines.py

import redis
import json


class FirstPipeline(object):
    conn = None

    def open_spider(self, spider):
        # 連接數據庫 Redis
        self.conn = redis.Redis(
            host='127.0.0.1',
            port=6379
        )

    def process_item(self, item, spider):
        data_dict = {
            'author':item['author'],
            'content':item['content']
        }
        try:
            data_dict = json.dumps(data_dict)
            self.conn.lpush('data', data_dict)
            return item
        except Exception as e:
            print(e)

    def close_spider(self,spider):
        print('ok')

流程總結：

redis語法：
http://www.runoob.com/redis/redis-install.html
redis-server.exe redis.windows.conf
redis-cli.exe -h 127.0.0.1 -p 6379
redis.Redis lpush(‘data’, data_dict)
寫入redis數據的格式必須是byte,string,number
lrange data 0 -1

同時存入文件，MySQL，redis中：

在管道文件中同時定義三個管道類，把三個pipelines文件合併
settings裏設置優先級，數字越小越先執行，0-1000.

學習：
https://www.cnblogs.com/foremostxl/p/10085232.html#_label9

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scrapy框架——持久化存儲

文章目錄

磁盤文件

基於終端指令

流程總結

基於管道

流程總結：

數據庫

mysql:

流程總結：

redis:

流程總結：

同時存入文件，MySQL，redis中：

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

常見問題（補充中）

列表和元組（更新後）

scrapy框架——持久化存儲

爬蟲——scrapy框架基礎

反思

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結