scrapy主動退出爬蟲的代碼片段(python3)
問題:在運行scrapy的過程中,如果想主動退出該怎麼做?
背景:比如說我只要爬取當日的新聞,那麼在遍歷的時候,如果出現了超過1條不是當日的新聞,那麼就不爬取了,就主動退出爬蟲,這個時候該怎麼做呢?
IDE:pycharm
版本:python3
框架:scrapy
系統:windows10
代碼如下:
# -*- coding: utf-8 -*- import scrapy from torrentSpider.items.NavigationItem import NavigationItem from torrentSpider.items.TorrentItem import TorrentItem import time import random import logging import os class XxxSpider(scrapy.Spider): name = "xxx_spider" allowed_domains = ['www.xxx.com'] start_urls = ['http://www.xxx.com/1.html'] # 網站前綴 web_pre_url = 'http://xxx.com' # 計數 count = 0 def parse(self, response): # 設置請求也隨機延遲 time.sleep(random.randint(0, 5)) # 獲取導航欄的數量 navigation_type_number = response.xpath('//*[@id="hypoNav"]/div/ul/li/em/a/text()').extract() for n_k in range(1, len(navigation_type_number)): navigation_item = NavigationItem() # 網站標題 navigation_item['navigation_title'] = response.xpath('//*[@id="logoSea"]/div[1]/a/img/@alt').extract()[0] # 導航欄目分類名稱 navigation_item['navigation_type'] = response.xpath('//*[@id="hypoNav"]/div/ul/li['+str(n_k+1)+']/em/a/text()').extract()[0] # 導航鏈接 navigation_item['navigation_url'] = response.xpath('//*[@id="hypoNav"]/div/ul/li['+str(n_k+1)+']/em/a/@href').extract()[0] # 獲取子導航欄的數量 sub_navigation_type_number = response.xpath('//*[@id="nodeNav"]/div/ul/li/em/a/span/text()').extract() for sub_k in range(1, len(sub_navigation_type_number)): sub_navigation_item = NavigationItem() # 網站標題 sub_navigation_item['navigation_title'] = response.xpath('//*[@id="logoSea"]/div[1]/a/img/@alt').extract()[0] # 副導航欄目分類名稱 sub_navigation_item['sub_navigation_type'] = response.xpath('//*[@id="nodeNav"]/div/ul/li['+str(sub_k)+']/em/a/span/text()').extract()[0] # 副導航欄鏈接 sub_navigation_item['sub_navigation_url'] = response.xpath('//*[@id="nodeNav"]/div/ul/li['+str(sub_k)+']/em/a/@href').extract()[0] # 獲取每頁電影條目數長度 movie_name_tr_array = response.xpath('/html/body/div[2]/table[1]/tr/td[1]/table[2]/tbody/tr').extract() for i_k in range(1, len(movie_name_tr_array)): # 子鏈接 str_sub_url = '/html/body/div[2]/table[1]/tr/td[1]/table[2]/tbody/tr['+str(i_k)+']/td[1]/a/@href' m_link = self.web_pre_url + response.xpath(str_sub_url).extract()[0] yield scrapy.Request(url=m_link, callback=self.parse_links, dont_filter=True) # 解析下一頁 next_link = response.xpath('//*[@class="pagegbk"]/@href').extract() if next_link: if len(next_link) == 1: next_link = next_link[0] else: next_link = next_link[1] yield scrapy.Request(self.web_pre_url + next_link, callback=self.parse) # 爬取子鏈接 def parse_links(self, response): torrent_item = TorrentItem() # 標題 torrent_item['torrent_title'] = self.check_xpath_value(response, '/html/body/div[2]/table[1]/tbody/tr/td/font/text()') # 影片名稱 torrent_item['torrent_name'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[1]/text()') # 導演 torrent_item['torrent_director'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[2]/text()') # 影片演員 torrent_item['torrent_actor'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/span/font[2]/text()') # 語言 torrent_item['torrent_language'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[3]/text()') # 影片類型 torrent_item['torrent_type'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[4]/text()') # 影片地區 torrent_item['torrent_region'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[5]/text()') # 更新時間 torrent_item['torrent_update_time'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[6]/text()') # 影片狀態 torrent_item['torrent_status'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[7]/text()') # 上映日期 torrent_item['torrent_show_time'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[1]/font[8]/text()') # 劇情介紹 torrent_item['torrent_introduction'] = self.check_xpath_value(response, '/html/body/div[2]/table[2]/tbody/tr/td/div[2]/text()') # 影片地址 torrent_item['torrent_url'] = self.check_xpath_value(response, '//*[@id="plist"]/table[2]/tbody/tr[2]/td/ul/li/input/@value') # 獲取當前時間並格式化 current_date = time.strftime('%Y-%m-%d', time.localtime()) print('current_date = %s' % str(current_date)) print('torrent_update_time = %s' % torrent_item['torrent_update_time']) # 如果不是當天的就不爬取,並且計數 if torrent_item['torrent_update_time'] == str(current_date): yield torrent_item else: self.count = self.count + 1 # 判斷計數是否超過50,超過就不爬取了 if self.count > 1: # logging.info("計數超過10,停止爬蟲") self.crawler.engine.close_spider(self, '計數超過10,停止爬蟲!') pass # 判斷是否爲空 @staticmethod def check_xpath_value(response, xpath_url): xpath_value = response.xpath(xpath_url).extract() if xpath_value: if xpath_value[0].strip() != '': return xpath_value[0] else: return "null" else: return "null"
注意以上代碼中標紅的地方:
self.crawler.engine.close_spider(self, '計數超過10,停止爬蟲!')
1,此行代碼是寫在spider文件中的
2,雖然這一行代碼會停止爬蟲,但是這一行代碼的停止並不是立即停止
原因是因爲當我們不更改爬蟲的setting.py文件的時候,默認配置是:
# Configure maximum concurrent requests performed by Scrapy (default: 16) # CONCURRENT_REQUESTS = 32
含義就是:Scrapy downloader 併發請求(concurrent requests)的最大值,默認: 16
那麼這個時候的問題來了,按照以上的寫法,在隊列裏就已經有十幾個請求了,你停止之後,這十幾個請求依舊會執行下去,所以並不是立即停止,如果想改變的話,就必須改變此項配置,設爲:
CONCURRENT_REQUESTS = 1
具體scrapy爬蟲原理請自行百度,並請自行調試,謝謝~