Scrapy 功能整理

spider的功能整理

# -*- coding: utf-8 -*-
import scrapy
from Demo.items import DemoItem

class DemoSpider(scrapy.Spider):
    #爬蟲名
    name = 'Demo'
    #爬蟲的爬取域
    allowed_domains = ['XXX.com']
    #起始的URL列表
    start_urls = [baseURL + str(offset)]
    #rules匹配規則
    rules = (
        # 匹配 'ABC.php' (但不匹配 'DEF.php') 的鏈接並跟進鏈接
        Rule(LinkExtractor(allow=('ABC\.php', ), deny=('DEF\.php', ))),
        # 提取匹配 'item.php' 的鏈接並使用spider的parse_item方法進行分析
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item')
        )

    #默認的parse方法
    def parse(self, response):

    pass

    #處理response並返回處理的數據item以及跟進的URL
    def parse_item(self, response):
        item = DemoItem()
        #從response中獲得數據集
        node_list = response.xpath("XXX")
        #提取數據，並將其轉換爲utf-8編碼，然後傳入item中
        for node in node_list:
            item["XXX"] =node.xpath("./XXX/text()").extract()[0]
            item["XXX"] =node.xpath("./XXX/a/@href").extract()[0]
            #有爲空的情況?
            if len(node.xpath("./XXX/text()")):
                item["XXX"] =node.xpath("./XXX/text()").extract()[0]
            else:
                item["XXX"] = ""
            #返回item數據
            yield item
        #返回跟進的URL請求
        if len(response.xpath("XXX")) == 0:
            url = response.xpath("XXX/@href").extract()[0]
            yield scrapy.Request("XXX" + url,callback = self.parse)

        pass

選擇器selectors整理

Scrapy選擇器構建於 lxml 庫之上，這意味着它們在速度和解析準確性上非常相似。

用法：
scrapy shell http://www.baidu.com
response.selector.xpath(‘//span/text()’).extract()

由於在response中使用XPath、CSS查詢十分普遍，因此，Scrapy提供了兩個實用的快捷方式:

>>> response.xpath("//title/text()")
[<Selector xpath='//title/text()' data='百度一下，你就知道'>]

>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='百度一下，你就知道'>]

>>> response.xpath('//title/text()').extract()
['百度一下，你就知道']

>>> response.xpath('//title/text()').extract()[0]
'百度一下，你就知道'

>>> response.xpath('//title/text()').extract()[0] is None
False

xpath和css的用法（摘自scrapy幫助文檔）：

//HTML結構
</head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

>>> response.xpath('//base/@href').extract()
[u'http://example.com/']

>>> response.css('base::attr(href)').extract()
[u'http://example.com/']

>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.css('a[href*=image]::attr(href)').extract()
[u'image1.html',
 u'image2.html',
 u'image3.html',
 u'image4.html',
 u'image5.html']

>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

>>> response.css('a[href*=image] img::attr(src)').extract()
[u'image1_thumb.jpg',
 u'image2_thumb.jpg',
 u'image3_thumb.jpg',
 u'image4_thumb.jpg',
 u'image5_thumb.jpg']

結合正則表達式使用選擇器:
由於.re() 方法返回unicode字符串的列表。所以無法構造嵌套式的 .re() 調用

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

[u'My image 1',
 u'My image 2',
 u'My image 3',
 u'My image 4',
 u'My image 5']

#返回單個數據re_first
>>> response.xpath('//a[contains(@href, "image")]/text()').re_first(r'Name:\s*(.*)')

u'My image 1'

注意區別：

# 得到整篇文檔的p標籤
>>> for p in divs.xpath('//p'):  
...     print p.extract()

# 得到divs內部的p標籤
>>> for p in divs.xpath('.//p'):  
...     print p.extract()

#得到divs下的直系p標籤
>>> for p in divs.xpath('p'):
...     print p.extract()

item整理

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

item字段就像是一個字典集合一樣，可以通過鍵值的方式進行訪問。通過鍵值的映射關係，提高了代碼的正確性和可用性。

item.keys() //獲得所有鍵
item.items() //獲得鍵值集合
dict(item) //轉換成字典對象

pipeline管道的整理

當Item在Spider中被收集之後，它將會被傳遞到Item Pipeline，一些組件會按照一定的順序執行對Item的處理。

pipeline的作用：
1. 清理HTML數據
2. 驗證爬取的數據(檢查item包含某些字段)
3. 查重(並丟棄)
4. 將爬取結果保存到數據庫中

import json

class MyspiderPipeline(object):
    def __init__(self):
        self.f = open("demo.json","wb+")

    #該方法是必要的
    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii = False) + ",\n"
        self.f.write(content.encode("utf-8"))
        return item

    def colse_spider(self,spider):
        self.f.close()

將數據寫入mongoDB（摘自scrapy幫助文檔）

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert(dict(item))
        return item

去重操作：

from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):

    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

settings.py中通過設置ITEM_PIPELINES決定管道優先級：
（數字越小優先級越高）

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

模擬用戶登錄

import scrapy

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        # check login succeed before going on
        if "authentication failed" in response.body:
            self.logger.error("Login failed")
            return

郵件的發送

在spider中執行結束後發送郵件（前提是獲取了授權碼）
參考：http://blog.csdn.net/you_are_my_dream/article/details/60868329

def closed(self, reason):# 爬取結束的時候發送郵件  
        from scrapy.mail import MailSender  

        mailer = MailSender(  
            smtphost = "smtp.163.com",  # 發送郵件的服務器  
            mailfrom = "***********@163.com",   # 郵件發送者  
            smtpuser = "***********@163.com",   # 用戶名  
            smtppass = "***********",  # 發送郵箱的密碼不是你註冊時的密碼，而是授權碼！！！切記！  
            smtpport = 25   # 端口號  
        )  
        body = u""" 
        發送的郵件內容 
        """  
        subject = u'發送的郵件標題'  
        # 如果說發送的內容太過簡單的話，很可能會被當做垃圾郵件給禁止發送。  
        mailer.send(to=["****@qq.com", "****@qq.com"], subject = subject.encode("utf-8"), body = body.encode("utf-8"))

下載及處理文件和圖片

Scrapy爲下載item中包含的文件提供了一個可重用的 item pipelines .
這些pipeline有些共同的方法和結構(我們稱之爲media pipeline)。一般來說你會使用Files Pipeline或者
Images Pipeline.

大致流程

在一個爬蟲裏，你抓取一個項目，把其中圖片的URL放入 file_urls 組內。
項目從爬蟲內返回，進入項目管道。
當項目進入 FilesPipeline，file_urls 組內的URLs將被Scrapy的調度器和下載器安排下載，當優先級更高，會在其他頁面被抓取前處理。項目會在這個特定的管道階段保持“locker”的狀態，直到完成文件的下載（或者由於某些原因未完成下載）。
當文件下載完後，另一個字段(files)將被更新到結構中。這個組將包含一個字典列表，其中包括下載文件的信息，比如下載路徑、源抓取地址（從 file_urls 組獲得）和圖片的校驗碼(checksum)。
files 列表中的文件順序將和源 file_urls 組保持一致。如果某個圖片下載失敗，將會記錄下錯誤信息，圖片也不會出現在 files 組中。

注意：使用圖片管道( Images Pipeline)需要先安裝Pillow 庫。

啓用 Media Pipeline

首先在settings.py中添加ITEM_PIPELINES 字段(設定優先級)，FILES_STORE字段(設置文件存放位置)，IMAGES_STORE字段(圖片存放位置)。

import os
import scrapy
from scrapy.pipelines.images import ImagesPipeline
from Douyu.settings import IMAGES_STORE as img_store

#繼承ImagesPipeline類，對圖片進行處理
class DouyuPipeline(ImagesPipeline):

    def get_media_requests(self,item,info):
        imglink = item["imglink"]
        yield scrapy.Request(imglink)

    def item_completed(self, results, item, info):
        imagepath = [x["path"] for ok,x in results if ok]
        os.rename(img_store + imagepath[0],img_store + item["nickname"] + ".jpg")
        return item

對於圖片會根據它的鏈接地址生成哈希值，然後圖片會被命名爲該哈希值然後存儲在一個full文件夾下。

若要生成縮略圖，需要增設：

IMAGES_THUMBS = {
    'small': (50, 50),
    'big': (270, 270),
}

scrapy的架構

架構組件

Scrapy Engine
引擎負責控制數據流在系統中所有組件中流動，並在相應動作發生時觸發事件。

調度器(Scheduler)
調度器從引擎接受request並將他們入隊，以便之後引擎請求他們時提供給引擎。

下載器(Downloader)
下載器負責獲取頁面數據並提供給引擎，而後提供給spider。

Spiders
Spider是Scrapy用戶編寫用於分析response並提取item(即獲取到的item)或額外跟進的URL的類。每個spider負責處理一個特定(或一些)網站。更多內容請看 Spiders 。

Item Pipeline
Item Pipeline負責處理被spider提取出來的item。典型的處理有清理、驗證及持久化(例如存取到數據庫中)。

下載器中間件(Downloader middlewares)
下載器中間件是在引擎及下載器之間的特定鉤子(specific hook)，處理Downloader傳遞給引擎的response。其提供了一個簡便的機制，通過插入自定義代碼來擴展Scrapy功能。

Spider中間件(Spider middlewares)
Spider中間件是在引擎及Spider之間的特定鉤子(specific hook)，處理spider的輸入(response)和輸出(items及requests)。其提供了一個簡便的機制，通過插入自定義代碼來擴展Scrapy功能。

數據流(Data flow)過程

Scrapy中的數據流由執行引擎控制，其過程如下:

引擎打開一個網站(open a domain)，找到處理該網站的Spider並向該spider請求第一個要爬取的URL(s)。
引擎從Spider中獲取到第一個要爬取的URL並在調度器(Scheduler)以Request調度。
引擎向調度器請求下一個要爬取的URL。
調度器返回下一個要爬取的URL給引擎，引擎將URL通過下載中間件(請求(request)方向)轉發給下載器(Downloader)。
一旦頁面下載完畢，下載器生成一個該頁面的Response，並將其通過下載中間件(返回(response)方向)發送給引擎。
引擎從下載器中接收到Response並通過Spider中間件(輸入方向)發送給Spider處理。
Spider處理Response並返回爬取到的Item及(跟進的)新的Request給引擎。
引擎將(Spider返回的)爬取到的Item給Item Pipeline，將(Spider返回的)Request給調度器。
(從第二步)重複直到調度器中沒有更多地request，引擎關閉該網站。

參考

Scrapy 1.0 文檔

使用scrapy的mail模塊發送郵件

Scrapy 功能整理

spider的功能整理

選擇器selectors整理

item整理

pipeline管道的整理

模擬用戶登錄

郵件的發送

下載及處理文件和圖片

大致流程

啓用 Media Pipeline

scrapy的架構

架構組件

數據流(Data flow)過程

參考

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

初用WebService

爬蟲Scrapy框架的安裝配置

向MVC的Model中添加驗證

CLR via C#垃圾回收

Scrapy命令和 User Agent

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結