scrapy主動退出爬蟲的代碼片段(python3)

問題：在運行scrapy的過程中，如果想主動退出該怎麼做？

背景：比如說我只要爬取當日的新聞，那麼在遍歷的時候，如果出現了超過1條不是當日的新聞，那麼就不爬取了，就主動退出爬蟲，這個時候該怎麼做呢？

IDE：pycharm

版本：python3

框架：scrapy

系統：windows10

代碼如下：

# -*- coding: utf-8 -*-
import scrapy
from torrentSpider.items.NavigationItem import NavigationItem
from torrentSpider.items.TorrentItem import TorrentItem
import time
import random
import logging
import os


class XxxSpider(scrapy.Spider):
    name = "xxx_spider"
    allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/1.html']

    # 網站前綴
    web_pre_url = 'http://xxx.com'
    # 計數
    count = 0

    def parse(self, response):

        # 設置請求也隨機延遲
        time.sleep(random.randint(0, 5))

        # 獲取導航欄的數量
        navigation_type_number = response.xpath('//*[@id="hypoNav"]/div/ul/li/em/a/text()').extract()
        for n_k in range(1, len(navigation_type_number)):
            navigation_item = NavigationItem()
            # 網站標題
            navigation_item['navigation_title'] = response.xpath('//*[@id="logoSea"]/div[1]/a/img/@alt').extract()[0]
            # 導航欄目分類名稱
            navigation_item['navigation_type'] = response.xpath('//*[@id="hypoNav"]/div/ul/li['+str(n_k+1)+']/em/a/text()').extract()[0]
            # 導航鏈接
            navigation_item['navigation_url'] = response.xpath('//*[@id="hypoNav"]/div/ul/li['+str(n_k+1)+']/em/a/@href').extract()[0]

        # 獲取子導航欄的數量
        sub_navigation_type_number = response.xpath('//*[@id="nodeNav"]/div/ul/li/em/a/span/text()').extract()
        for sub_k in range(1, len(sub_navigation_type_number)):
            sub_navigation_item = NavigationItem()
            # 網站標題
            sub_navigation_item['navigation_title'] = response.xpath('//*[@id="logoSea"]/div[1]/a/img/@alt').extract()[0]
            # 副導航欄目分類名稱
            sub_navigation_item['sub_navigation_type'] = response.xpath('//*[@id="nodeNav"]/div/ul/li['+str(sub_k)+']/em/a/span/text()').extract()[0]
            # 副導航欄鏈接
            sub_navigation_item['sub_navigation_url'] = response.xpath('//*[@id="nodeNav"]/div/ul/li['+str(sub_k)+']/em/a/@href').extract()[0]

        # 獲取每頁電影條目數長度
        movie_name_tr_array = response.xpath('/html/body/div[2]/table[1]/tr/td[1]/table[2]/tbody/tr').extract()
        for i_k in range(1, len(movie_name_tr_array)):
            # 子鏈接
            str_sub_url = '/html/body/div[2]/table[1]/tr/td[1]/table[2]/tbody/tr['+str(i_k)+']/td[1]/a/@href'
            m_link = self.web_pre_url + response.xpath(str_sub_url).extract()[0]
            yield scrapy.Request(url=m_link, callback=self.parse_links, dont_filter=True)

        # 解析下一頁
        next_link = response.xpath('//*[@class="pagegbk"]/@href').extract()
        if next_link:
            if len(next_link) == 1:
                next_link = next_link[0]
            else:
                next_link = next_link[1]
            yield scrapy.Request(self.web_pre_url + next_link, callback=self.parse)

    # 爬取子鏈接
    def parse_links(self, response):
        torrent_item = TorrentItem()
        # 標題
        torrent_item['torrent_title'] = self.check_xpath_value(response, '/html/body/div[2]/table[1]/tbody/tr/td/font/text()')
        # 影片名稱
        torrent_item['torrent_name'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[1]/text()')
        # 導演
        torrent_item['torrent_director'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[2]/text()')
        # 影片演員
        torrent_item['torrent_actor'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/span/font[2]/text()')
        # 語言
        torrent_item['torrent_language'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[3]/text()')
        # 影片類型
        torrent_item['torrent_type'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[4]/text()')
        # 影片地區
        torrent_item['torrent_region'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[5]/text()')
        # 更新時間
        torrent_item['torrent_update_time'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[6]/text()')
        # 影片狀態
        torrent_item['torrent_status'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[7]/text()')
        # 上映日期
        torrent_item['torrent_show_time'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[8]/text()')
        # 劇情介紹
        torrent_item['torrent_introduction'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[2]/text()')
        # 影片地址
        torrent_item['torrent_url'] = self.check_xpath_value(response, '//*[@id="plist"]/table[2]/tbody/tr[2]/td/ul/li/input/@value')

        # 獲取當前時間並格式化
        current_date = time.strftime('%Y-%m-%d', time.localtime())
        print('current_date = %s' % str(current_date))
        print('torrent_update_time = %s' % torrent_item['torrent_update_time'])
        # 如果不是當天的就不爬取，並且計數
        if torrent_item['torrent_update_time'] == str(current_date):
            yield torrent_item
        else:
            self.count = self.count + 1
            # 判斷計數是否超過50，超過就不爬取了
            if self.count > 1:
                # logging.info("計數超過10，停止爬蟲")
                self.crawler.engine.close_spider(self, '計數超過10，停止爬蟲!')
            pass

    # 判斷是否爲空
    @staticmethod
    def check_xpath_value(response, xpath_url):
        xpath_value = response.xpath(xpath_url).extract()
        if xpath_value:
            if xpath_value[0].strip() != '':
                return xpath_value[0]
            else:
                return "null"
        else:
            return "null"

注意以上代碼中標紅的地方：

self.crawler.engine.close_spider(self, '計數超過10，停止爬蟲!')

1，此行代碼是寫在spider文件中的

2，雖然這一行代碼會停止爬蟲，但是這一行代碼的停止並不是立即停止

原因是因爲當我們不更改爬蟲的setting.py文件的時候，默認配置是：

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

含義就是：Scrapy downloader 併發請求(concurrent requests)的最大值,默認: 16

那麼這個時候的問題來了，按照以上的寫法，在隊列裏就已經有十幾個請求了，你停止之後，這十幾個請求依舊會執行下去，所以並不是立即停止，如果想改變的話，就必須改變此項配置，設爲：

CONCURRENT_REQUESTS = 1

具體scrapy爬蟲原理請自行百度，並請自行調試，謝謝~

posted @ 2019-01-16 18:05 故意養只喵叫順兒閱讀(...) 評論(...) 編輯收藏

scrapy主動退出爬蟲的代碼片段(python3)

scrapy主動退出爬蟲的代碼片段(python3)

python gdal 安裝使用（Windows， python 3.6.8）

spring讀取多個properties文件

自動化測試框架搭建過程

Jackson將json字符串轉換成泛型List

通過eclipse調整tomcat java jvm大小

關於springmvc返回中文亂碼

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結