Python爬蟲03:Scrapy庫

本文參考Scrapy官方文檔寫成,詳細內容參見文檔.

Scrapy庫的示例程序

Scrapy爬蟲示例1: 使用爬蟲發送請求

創建並運行一個爬蟲項目

創建Scrapy項目: 在命令行中輸入scrapy startproject tutorial即可創建一個Scrapy項目,該項目名爲tutorial,生成的項目文件的目錄結構如下:

tutorial/
    scrapy.cfg            # Scrapy項目的配置文件
    tutorial/             # 項目的Python代碼
        __init__.py
        items.py          # 定義實體類的文件
        middlewares.py    # 定義中間件的文件
        pipelines.py      # 定義管道的文件
        settings.py       # 定義設置的文件
        spiders/          # 存儲爬蟲的目錄
            __init__.py

編寫爬蟲: 在命令行輸入scrapy genspider quotes quotes.toscrape.com,可以發現在tutorial/spiders目錄下有一個新建的爬蟲文件quotes.py,修改其內容如下:

import scrapy

class QuotesSpider(scrapy.Spider):
    # 定義爬蟲名
    name = "quotes"

    # 定義爬蟲發出的第一個請求的方法
    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

	# 定義處理響應的方法
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

我們的爬蟲類QuotesSpider繼承自爬蟲基類Spider,其各屬性和方法的意義如下:

name屬性: 表示爬蟲名,一個項目中的爬蟲名字不能重複.
start_requests()方法: 返回一個可迭代的Request集合(一個列表或生成器),爬蟲從該請求開始爬取內容.
parse(self, response)方法: 定義如何處理每個請求返回的數據,response參數爲一個TextResponse對象,代表每次請求返回的響應值.該方法可以將爬取到的數據以字典或Item對象形式返回,或者創建新的Request對象發起請求.

運行爬蟲: 在命令行中輸入scrapy crawl quotes即可運行剛剛寫好的爬蟲,可以看到控制檯輸出日誌且爬取到的數據被存儲進文件中,正如我們在parse()函數中定義的那樣.

上述爬蟲程序的執行過程: Scrapy框架調度start_requests()方法返回的Request對象,在得到響應時,實例化Response對象並調用callback參數指定的回調方法(在本例中爲parse()函數)並將該Response對象作爲參數傳遞給回調方法.

使用`start_urls`屬性替代`start_requests()`方法指定起始請求

通過在start_urls屬性中定義一個URL列表,我們可以替代start_requests()發起爬蟲的起始請求.Scrapy框架會遍歷start_urls併發起請求,並以parse()方法作爲默認的回調函數.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    # 定義 start_urls 替代 start_requests()方法
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

Scrapy爬蟲示例2: 使用爬蟲解析響應

解析響應提取數據

解析響應數據: parse(self, response)方法的response參數是一個Response對象,可以調用該對象的css(),xpath()和re()方法對響應數據進行CSS,XPath和正則解析.

對response調用css()或xpath()方法解析得到的是一個Selector對象列表.對調用其getall()方法會以字符串列表形式返回所有內容,調用get(self, default=None)方法可以以字符串形式返回其第一個元素的內容.

Selector對象的extract()和extract_first()方法是舊版本的方法,已經被棄用.
```
response.css('title')
# 得到 [<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

response.css('title').getall()
# 得到 ['<title>Quotes to Scrape</title>']

response.css('title::text')
# 得到 [<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]

response.css('title::text').get()
# 得到 'Quotes to Scrape'
```
調用Selector對象的re()方法也可以以字符串列表的形式返回所有匹配元素的內容.
```
response.css('title::text').re(r'Quotes.*')
# 得到 ['Quotes to Scrape']

response.css('title::text').re(r'Q\w+')
# 得到 ['Quotes']

response.css('title::text').re(r'(\w+) to (\w+)')
# 得到 ['Quotes', 'Scrape']
```

在parse()函數中返回一個字典對象即可以把爬取到的數據以字典形式返回.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    # 解析頁面信息並以字典形式將其返回
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

在命令行中執行scrapy crawl quotes啓動爬蟲,可以看到我們所爬的數據被打印在控制檯上了.

將解析到的數據存儲到文件中

在scrapy crawl命令後加參數-o [輸出文件名],即可將運行爬蟲得到的數據存儲到文件中.

scrapy crawl quotes -o quotes.json

運行上述命令後,可以在當前目錄下看到quotes.json文件,其中存儲的是以json格式存儲的爬取到的數據.

可以使用-t 數據格式將爬取到的數據以其它格式存儲,例如:

scrapy crawl quotes -t csv -o quotes.csv

Scrapy爬蟲示例3: 使用爬蟲實現自動翻頁

要實現自動翻頁的功能,就要解析下一頁的URL.對於下面的頁面:

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我們使用response.css('li.next a::attr(href)').get()或response.css('li.next a').attrib['href']都可以得到下一頁的相對URL路徑.

通過使`parse()`方法返回`Request`請求以跳轉到下一頁

parse()方法不僅可以返回數據,也可以返回Request對象,表示發起另一個請求,需要我們顯式定義解析請求的回調函數.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # 若存在下一頁的功能,則請求下一頁並調用parse()方法解析響應
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(url=next_page, callback=self.parse)

在這裏,我們通過urljoin()方法將相對路徑轉爲絕對路徑,傳遞給Requset構造方法的是絕對URL路徑.

使用`follow()`方法跳轉到下一頁

使用follow()方法可以簡化上面創建Request的操作,將下一頁的地址作爲url參數傳遞給follow()方法後,Scrapy框架會請求該URL並將該函數自身作爲回調方法,這樣就達到了跟隨URL的效果.值得注意的是,follow()方法既支持絕對URL路徑,也支持相對URL路徑.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('span small::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            # 使用follow()方法"跟隨"下一頁,該方法支持相對路徑
            yield response.follow(url=next_page, callback=self.parse)

Scrapy爬蟲示例4: 一個完整的爬蟲

下面爬蟲是一個較完整的爬蟲,該爬蟲可以自動翻頁,爬取整個網站所有的作者信息.

import scrapy

class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).get(default='').strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

Scrapy命令

創建和管理Scrapy項目的命令

創建項目的命令:
```
scrapy startproject <項目名>
```

創建爬蟲的命令:

scrapy genspider <爬蟲名> <爬取網站的域名>

啓動爬蟲的命令:
```
scrapy crawl <爬蟲名>
```

Scrapy shell

因爲我們是通過Scrapy命令而非運行python腳本的方式啓動爬蟲,debug比較困難.所以我們可以使用Scrapy控制檯來動態的查看Scrapy的運行情況.

進入Scrapy shell的命令

進入Scrapy控制檯的語法如下:

scrapy shell [URL]

其中URL參數表示要請求的URL地址,也可以是本地文件的相對或絕對路徑

# Web resources
scrapy shell 'http://quotes.toscrape.com/page/1/'

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

Scrapy shell的內置方法和對象

進入Scrapy shell後,可以調用shelp()方法查看所有的內置的方法和對象.

Scrapy shell的常用內置方法和對象如下:

方法或對象	作用
`request`	上一個請求對象
`response`	上一個請求的響應對象
`settings`	Scrapy設置對象,即`setiings.py`中的內容
`spider`	能處理當前URL的`Spider`對象若當前項目中沒有能處理當前URL的`Spider`實現類,則將被設置爲一個`DefaultSpider`對象
`fetch(url[, redirect=True])`或`fetch(req)`	利用該URL或`Request`對象發送請求,並更新當前作用域內的對象(包括`request`和`response`對象)
`view(response)`	在瀏覽器中查看該`response`對象

fetch("https://reddit.com")
response.xpath('//title/text()').get() 	# 得到 'reddit: the front page of the internet'


request = request.replace(method="POST")
fetch(request)
response.status		# 得到 404
response.headers	# 得到 {'Accept-Ranges': ['bytes'], 'Cache-Control': ['max-age=0, must-revalidate'], 'Content-Type': ['text/html; charset=UTF-8'], ...}

在爬蟲程序中進入Scrapy shell

在爬蟲程序中調用scrapy.shell.inspect_response()方法,當爬蟲程序執行到此行時,可以進入Scrapy shell.

例如下面程序

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [ "http://example.com", "http://example.org", "http://example.net"]

    def parse(self, response):
        # 我們要在請求"http://example.org"後進入Scrapy shell檢查響應內容
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # 剩餘代碼

運行程序後會進入Scrapy shell

>>> response.url
'http://example.org'

對爬取數據進行處理

數據實體類

要想使用管道處理爬取到的數據,就要定義數據實體類(Item)並在parse()方法中將數據以該實體類對象的形式返回.

數據實體類的定義

數據實體類一般被定義在項目根目錄的items.py文件內.所有數據實體類必須繼承自Item對象,且其字段必須定義爲Field對象.

import scrapy

class Product(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()
    stock = scrapy.Field()
    tags = scrapy.Field()
    last_updated = scrapy.Field(serializer=str)

值得注意的是,這些Field對象將不會被賦值爲類屬性,但是我們可以通過Item.fields屬性訪問它們.

數據實體類的使用

Item類支持dict-API,可以像使用字典一樣使用Item對象.

創建實體類

product = Product(name='Desktop PC', price=1000)
print(product)
# 得到 Product(name='Desktop PC', price=1000)

獲取字段值

product['name']			# 得到 'Desktop PC'
product.get('name')		# 得到 'Desktop PC'
product['price']		# 得到 1000

# 讀取未賦值的字段值
product['last_updated']					# 報錯 KeyError: 'last_updated'
product.get('last_updated', 'not set')	# 得到 not set

# 讀取未定義的字段值
product['lala'] 						# 報錯 KeyError: 'lala'
product.get('lala', 'unknown field')	# 得到'unknown field'

'name' in product  					# True
'last_updated' in product  			# False
'last_updated' in product.fields  	# True
'lala' in product.fields  			# False

向字段賦值

# 向已定義字段賦值
product['last_updated'] = 'today'

# 向未定義字段賦值
product['lala'] = 'test' # 報錯: KeyError: 'Product does not support field: lala'

遍歷其屬性

product.keys()	# 得到 ['price', 'name']
product.items()	# 得到 [('price', 1000), ('name', 'Desktop PC')]

管道(Item Pipeline)

管道(Item Pipeline)可以對爬取到的數據進行處理,其典型應用有:

清理HTML數據
驗證數據(檢查爬取到的數據是否包含某字段)
數據去重(並丟棄)
將爬取到的數據存儲進數據庫

創建管道

管道一般被定義在項目根目錄下的pipelines.py文件內.

管道是一個Python類,它必須定義process_item()方法:

process_item(self, item, spider): 該方法定義了管道如何處理傳入的數據,會對每個傳入本節管道的元素都調用一次.它的參數意義如下:
- item: 表示從上一節管道傳入的數據,必須爲一個Item對象或字典.
- spider: 表示爬取到該數據的爬蟲,必須爲一個Spider.
若管道不丟棄該item數據,則必須返回一個Item對象或字典,該返回值將會被傳入下一節管道.若管道丟棄該item數據,只需拋出DropItem異常,被拋棄的元素不會進入下一節管道.

除此之外,管道還可以定義如下3個方法:

open_spider(self, spider): 該方法在爬蟲開啓時被調用,spider參數代表被開啓的爬蟲.
close_spider(self, spider): 該方法在爬蟲關閉時被調用,spider參數代表被關閉的爬蟲.
from_crawler(cls, crawler): 該方法若存在,則必須被定義爲類方法.該方法被Crawler調用以創建管道,它必須返回一個管道對象.

下面3個例子演示如何創建管道:

第一個管道PricePipeline演示了process_item()方法的使用

from scrapy.exceptions import DropItem

class PricePipeline(object):
    vat_factor = 1.15

    def process_item(self, item, spider):
        """
        定義管道處理實體類的邏輯:
        若該數據存在price屬性,則進行處理,否則丟棄該數據
        """
        if item.get('price'):
            if item.get('price_excludes_vat'):
                item['price'] = item['price'] * self.vat_factor
            return item
        else:
            raise DropItem("Missing price in %s" % item)

第二個管道JsonWriterPipeline演示了open_spider()方法和close_spider()方法的使用:

import json

class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('items.jl', 'w')

    def close_spider(self, spider):
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

第三個管道MongoPipeline演示了from_crawler方法的使用:

import pymongo

class MongoPipeline(object):

    collection_name = 'scrapy_items'

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        self.db[self.collection_name].insert_one(dict(item))
        return item

在配置文件中註冊管道

管道被定義好後,必須被註冊到根目錄下的settings.py文件中才能發揮作用,在settings.py文件中配置ITEM_PIPELINES變量以註冊管道,這是一個字典對象,其鍵爲管道類的全類名,值爲其優先級(越小越優先)

# 將爬取到的數據先進行驗證,再寫入json文件和數據庫中
ITEM_PIPELINES = {
    'tutorial.pipelines.PricePipeline': 300,
    'tutorial.pipelines.JsonWriterPipeline': 400,
    'tutorial.pipelines.MongoPipeline': 500,
}

`Request`類和`Response`類

Request對象和Response對象封裝了http請求與響應,我們有必要深入瞭解它們.

`Request`類

`Request`基類

Request基類封裝了一般的HTTP請求,它的構造函數如下:

scrapy.http.Request(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, flags=None, cb_kwargs=None),各參數意義如下:

url(string): 請求的URL.
callback(callable): 對響應調用的回調函數,接收對應的的Response響應對象爲其第一個參數.
method(string): HTTP請求的方法,默認爲"GET".
meta(dict): 請求的元數據,供Scrapy組件(中間件和插件)修改,有一些Scrapy保留的元數據鍵,不應當被覆蓋.

保留的元數據鍵有: dont_redirect, dont_retry, handle_httpstatus_list, handle_httpstatus_all, dont_merge_cookies, cookiejar, dont_cache, redirect_reasons, redirect_urls, bindaddress, dont_obey_robotstxt, download_timeout, download_maxsize, download_latency, download_fail_on_dataloss, proxy, ftp_user, ftp_password, referrer_policy, max_retry_times.
body(str或unicode): 請求體.
headers(dict): 請求頭.

cookies(dict或list): cookies數據,可以爲字典或字典列表,後者可以定製cookies的domain和path屬性.

# cookies屬性爲字典
request_with_cookies = Request(url="http://www.example.com",
                               cookies={'currency': 'USD', 'country': 'UY'})

# cookies屬性爲字典列表
request_with_cookies = Request(url="http://www.example.com",
                               cookies=[{'name': 'currency',
                                        'value': 'USD',
                                        'domain': 'example.com',
                                        'path': '/currency'}])

encoding(string): 編碼(默認爲"utf-8").
priority(int): 請求的優先級(默認爲0),數字越大越優先,允許負值.
dont_filter(boolean): 指定調度器是否過濾重複請求(默認爲False).
errback(callable): 處理請求中發生異常時調用的方法,接收對應的異常對象爲其第一個參數.
flags(list): 請求的標記,可以用於日誌.
cb_kwargs(dict): 傳給回調函數的參數.

`Request`子類

`FormRequest`: 封裝表單請求

scrapy.http.FormRequest類封裝了表單請求,其formdata屬性爲一個dict,封裝了所有的表單參數.

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

`JsonRequest`: 封裝JSON請求

scrapy.http.JsonRequest類封裝了JSON請求,其構造函數接收兩個參數:

data(JSON序列化對象): 代表JSON請求.body屬性會覆蓋該屬性.
dumps_kwargs(dict): 傳遞給json.dumps方法的用於序列化數據的參數

`Response`類

`Response`基類

Response基類封裝了一般的HTTP響應,它的構造函數如下:

classscrapy.http.Response(url, status=200, headers=None, body=b'', flags=None, request=None),各參數意義如下:

url(string): 響應的URL
status(integer): 響應的HTTP狀態碼,默認爲200.
headers(dict): 響應的頭信息.
body(bytes): 響應體
flags(list): 響應的標記(如'cached', 'redirected’),用於日誌.
request(Request object): 對應的請求對象.

`Response`子類

`TextResponse`: 封裝文字響應

scrapy.http.TextResponse類封裝了文字響應,有如下的屬性和方法:

encoding: 響應的編碼,獲取相應編碼的機制有以下四種(優先級從高到低):
- 構造函數傳入的encoding屬性
- HTTP響應頭定義的Content-Type屬性
- 響應體定義的編碼屬性
- 從響應體內容中推斷的編碼格式
text: 以unicode形式返回響應體,等價於response.body.decode(response.encoding)
selector: 返回針對當前響應體的Selector對象.
xpath(query): 等價於TextResponse.selector.xpath(query)
css(query): 等價於TextResponse.selector.css(query)
follow(url, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None): 向調度器註冊一個新的Request請求對象.url屬性可以是:
- 絕對URL
- 相對URL
- scrapy.link.Link對象
- Selector對象,如:response.css('a::attr(href)')[0], response.css('a')[0]

`HtmlResponse`: 封裝HTML響應

HtmlResponse是TextResponse的子類,支持解析<meta>標籤的http-equiv屬性.

`XmlResponse`: 封裝XML響應

XmlResponse是TextResponse的子類,支持解析XML首行的聲明.

Scrapy中間件

Scrapy庫的結構和數據流

Scrapy各組件關係及數據流如下圖所示:

Scrapy數據流是被執行引擎所控制的,一個完整的數據流如下:

引擎從爬蟲中得到了初始的Requset對象.
引擎將Request對象傳給調度器調度,並向調度器請求要被處理的Request對象.
調度器將要被處理的Request對象返回給引擎.
引擎將Request對象傳遞給下載器.其間經過下載器中間件(Downloader Middlewares).
一旦頁面下載完成,下載器生成一個Response對象並將其返回給引擎.其間經過下載器中間件(Downloader Middlewares).
引擎接收到下載器返回的Response對象後,將其傳遞給爬蟲對象來處理.其間經過爬蟲中間件(Spider Middleware).
爬蟲處理Response對象並將爬取到的數據或新的Request對象返回給引擎.其間經過爬蟲中間件(Spider Middleware).
若爬蟲返回的是爬取到的數據,則將其送入數據管道;若爬蟲返回的是新的Request對象,則將其傳遞給調度器並請求下一個要處理的Request對象.
從第1步開始重複直到調度器中沒有新的Request對象.

下載中間件(Downloader Middlewares)

下載中間件組成一箇中間件調用鏈.當一個Request對象經過中間件時,Scrapy會對其按正向順序依次調用中間件的process_request()方法;一個Response對象經過中間件時,Scrapy會對其按逆向順序依次調用中間件的process_request()方法.

創建下載中間件

下載中間件是一個Python類,它定義了下面4個方法中的一個或多個:

process_request(request, spider):

當有Request對象通過下載中間件時,該方法就會被調用.它的返回值可以是None,或一個Response對象,或一個Request對象,或拋出一個IgnoreRequest異常.其意義分別如下:
- 返回None時,表示正常傳遞該Request對象.Scrapy將會對該Request對象調用下一個中間件的process_request()方法或傳遞給下載器組件.
- 返回Response對象時,表示終止傳遞該Request對象且直接返回該Response對象.Scrapy將不會對該Request對象調用下一個中間件的process_request()和process_exception()方法,且該Request對象不會傳遞給下載器組件.新的Response對象會被返回且Scrapy會對其調用中間件的process_response()方法.
- 返回Request對象時,表示停止請求原有Request並向調度器傳入一個新的Request請求.crapy將不會對該Request對象調用下一個中間件的process_request()和process_exception()方法.
- 拋出IgnoreRequest異常時,Scrapy框架將會忽略該Request對象,並調用該中間件的process_exception()方法.若該方法不存在,則調用原Request對象的errback屬性指定的方法.若原Request對象未指定errback屬性,則該異常將會被忽略(甚至不會被打印在日誌裏).
process_response(request, response, spider):

當有Response對象通過下載中間件時,該方法就會被調用.它的返回值可以是一個Response對象,或一個Request對象,或拋出一個IgnoreRequest異常.其意義分別如下:
- 返回Response對象(返回的可以是傳入的Response對象,也可以是一個全新的Response對象)時,該Response對象將會被傳遞給下一個中間件的process_response()方法.
- 返回Request對象時,表示停止請求原有Request並向調度器傳入一個新的Request請求.Scrapy框架的行爲與process_request()方法返回Request對象時完全相同.
- 拋出IgnoreRequest異常時,Scrapy框架將會調用原Request對象的errback屬性指定的方法.若原Request對象未指定errback屬性,則該異常將會被忽略(甚至不會被打印在日誌裏).
process_exception(request, exception, spider):

當下載器出現異常或中間件的process_request()方法拋出異常時,該方法就會被調用.它的返回值可以是None,或一個Response對象,或一個Request對象.其意義分別如下:
- 返回None時,表示正常拋出該異常.Scrapy將會調用下一個中間件的process_exception()方法.
- 返回Response對象時,表示終止拋出該異常且直接返回該Response對象.Scrapy將不會調用其他中間件的process_exception()方法.
- 返回Request對象時,表示停止拋出該異常並向調度器傳入一個新的Request請求.Scrapy同樣將不會調用其他中間件的process_exception()方法.
from_crawler(cls, crawler):

該方法若存在,則必須被定義爲類方法.該方法被Crawler調用以創建中間件,它必須返回一箇中間件對象.

下面定義的中間件的作用是爲請求添加代理.

class CustomProxyMiddleware(object):

    def __init__(self, settings):
        self.proxies = settings.getlist('PROXIES')
        
    @classmethod
    def from_crawler(cls, crawler):
    	'''
    	創建中間件的邏輯: 根據HTTPPROXY_ENABLED配置項決定是否創建中間件
        '''
        if not crawler.settings.getbool('HTTPPROXY_ENABLED'):
            raise NotConfigured
        return cls(crawler.settings)

    def process_request(self, request, spider):
        '''
        處理請求的邏輯: 若請求未指定代理,則爲請求添加一個代理
        '''
        if not request.meta.get('proxy'):
            request.meta['proxy'] = random.choice(self.proxies)

    def process_response(self, request, response, spider):
        '''
        處理響應的邏輯: 根據響應內容判斷代理是否有效
        	若無效則刪除該代理並重新發起請求
        	若有效則直接返回該響應
    	'''
        proxy = request.meta.get('proxy')
        if response.status in (401, 403):
            self.proxies.remove(proxy)
            del request.meta['proxy']    
            return request
        return response

    def process_exception(self, request, exception, spider):
        '''
		處理異常的邏輯: 
			若接收到的是代理網絡質量的異常,則刪除該代理並重新發起請求
			若接收到的是其它異常,則交由其它中間件或請求對象自身來處理
        '''
        proxy = request.meta.get('proxy')
        if proxy and isinstance(exception, (ConnectionRefusedError, TimeoutError)):
            self.proxies.remove(proxy)
            del request.meta['proxy']    
            return request

上面代碼只是爲了演示中間件的定義而寫,若要爲請求指定代理池,請使用HttpProxyMiddleware中間件.

在配置文件中註冊下載中間件

下載中間件被註冊在項目根目錄下settings.py文件的DOWNLOADER_MIDDLEWARES屬性中.DOWNLOADER_MIDDLEWARES屬性將會與DOWNLOADER_MIDDLEWARES_BASE屬性合併,並按定義的權重順序被調用.

DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomProxyMiddleware': 543,
}

爬蟲中間件(Spider Middleware)

爬蟲中間件的創建與使用與下載中間件大同小異,可以參考官方文檔.

Python爬蟲03:Scrapy庫

Python爬蟲03:Scrapy庫

Scrapy庫的示例程序

Scrapy爬蟲示例1: 使用爬蟲發送請求

創建並運行一個爬蟲項目

使用start_urls屬性替代start_requests()方法指定起始請求

Scrapy爬蟲示例2: 使用爬蟲解析響應

解析響應提取數據

將解析到的數據存儲到文件中

Scrapy爬蟲示例3: 使用爬蟲實現自動翻頁

通過使parse()方法返回Request請求以跳轉到下一頁

使用follow()方法跳轉到下一頁

Scrapy爬蟲示例4: 一個完整的爬蟲

Scrapy命令

創建和管理Scrapy項目的命令

Scrapy shell

進入Scrapy shell的命令

Scrapy shell的內置方法和對象

在爬蟲程序中進入Scrapy shell

對爬取數據進行處理

數據實體類

數據實體類的定義

數據實體類的使用

管道(Item Pipeline)

創建管道

在配置文件中註冊管道

Request類和Response類

Request類

Request基類

Request子類

FormRequest: 封裝表單請求

JsonRequest: 封裝JSON請求

Response類

Response基類

Response子類

TextResponse: 封裝文字響應

HtmlResponse: 封裝HTML響應

XmlResponse: 封裝XML響應

Scrapy中間件

Scrapy庫的結構和數據流

下載中間件(Downloader Middlewares)

創建下載中間件

在配置文件中註冊下載中間件

爬蟲中間件(Spider Middleware)

使用`start_urls`屬性替代`start_requests()`方法指定起始請求

通過使`parse()`方法返回`Request`請求以跳轉到下一頁

使用`follow()`方法跳轉到下一頁

`Request`類和`Response`類

`Request`類

`Request`基類

`Request`子類

`FormRequest`: 封裝表單請求

`JsonRequest`: 封裝JSON請求

`Response`類

`Response`基類

`Response`子類

`TextResponse`: 封裝文字響應

`HtmlResponse`: 封裝HTML響應

`XmlResponse`: 封裝XML響應