Downloader Middlewares(下載器中間件)
下載器中間件是引擎和下載器之間通信的中間件。在這個中間件中我們可以設置代理、更換請求頭等來達到反反爬蟲的目的。要寫下載器中間件,可以在下載器中實現兩個方法。一個是process_request(self,request,spider),這個方法是在請求發送之前會執行,還有一個是process_response(self,request,response,spider),這個方法是數據下載到引擎之前執行。
process_request(self,request,spider):
這個方法是下載器在發送請求之前會執行的。一般可以在這個裏面設置隨機代理ip等。
- 參數:
request:發送請求的request對象。spider:發送請求的spider對象。 - 返回值:
返回None:如果返回None,Scrapy將繼續處理該request,執行其他中間件中的相應方法,直到合適的下載器處理函數被調用。
返回Response對象:Scrapy將不會調用任何其他的process_request方法,將直接返回這個response對象。已經激活的中間件的process_response()方法則會在每個response返回時被調用。
返回Request對象:不再使用之前的request對象去下載數據,而是根據現在返回的request對象返回數據。
如果這個方法中拋出了異常,則會調用process_exception方法。
process_response(self,request,response,spider):
這個是下載器下載的數據到引擎中間會執行的方法。
- 參數:
request:request對象。response:被處理的response對象。spider:spider對象。 - 返回值:
返回Response對象:會將這個新的response對象傳給其他中間件,最終傳給爬蟲。
返回Request對象:下載器鏈被切斷,返回的request會重新被下載器調度下載。
如果拋出一個異常,那麼調用request的errback方法,如果沒有指定這個方法,那麼會拋出一個異常。
隨機請求頭中間件:
隨機更改請求頭,可以在下載中間件中實現。在請求發送給服務器之前,隨機的選擇一個請求頭。這樣就可以避免總使用一個請求頭了。
user-agent列表:http://www.useragentstring.com/pages/useragentstring.php?typ=Browser
class UserAgentDownloadMiddleware(object):
USER_AGENTS=[
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36",
"Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
"Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
"Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
"Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0",
]
def process_request(self,request,spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent']=user_agent
ip代理池中間件
芝麻代理:http://http.zhimaruanjian.com/
太陽代理:http://http.taiyangruanjian.com/
快代理:http://www.kuaidaili.com/
訊代理:http://www.xdaili.cn/
螞蟻代理:http://www.mayidaili.com/
- 開放池:
class IPProxyDownloadMiddleware(object):
PROXIES = [
"ip:端口",
"ip:端口",
"ip:端口",
]
def process_request(self,request,spider):
proxy = random.choice(self.PROXIES)
print('被選中的代理:%s' % proxy)
request.meta['proxy'] = "http://" + proxy
- 獨享代理池:
class IPProxyDownloadMiddleware(object):
def process_request(self,request,spider):
proxy = 'ip:端口'
user_password = "xxxx:xxxx"
request.meta['proxy'] = proxy
# bytes
b64_user_password = base64.b64encode(user_password.encode('utf-8'))
request.headers['Proxy-Authorization'] = 'Basic ' + b64_user_password.decode('utf-8')
setting配置信息
- BOT_NAME:項目名稱。
- ROBOTSTXT_OBEY:是否遵守爬蟲協議。默認不遵守。
- CONCURRENT_ITEMS:代表pipeline同時處理的item數的最大值。默認是100
- CONCURRENT_REQUESTS:代表下載器併發請求的最大是,默認是16。
- DEFAULT_REQUEST_HEADERS:默認請求頭。可以將一些不會經常變化的請求頭放在這個裏面。
- DEPTH_LIMIT:爬取網站最大允許的深度。默認爲0,如果爲0,則沒有限制。
- DOWNLOAD_DELAY:下載器在下載某個頁面前等待多長的時間。該選項用來限制爬蟲的爬取速度,減輕服務器壓力。同時也支持小數。
- DOWNLOAD_TIMEOUT:下載器下載的超時時間。
- ITEM_PIPELINES:處理item的Pipeline,是一個字典,字典的key這個pipeline所在包的絕對路徑,值是一個整數,優先級,值越小,優先級越高。
- LOG_ENABLED:是否啓用logging。默認是True。
- LOG_ENCODING:log的編碼。
- LOG_LEVEL:log的級別。默認爲DEBUG。可選的級別有CRITICAL、ERROR、WARNING、INFO、DEBUG。
- USER_AGENT:請求頭。默認爲Scrapy/VERSION (+http://scrapy.org)。
- PROXIES:代理設置。
- COOKIES_ENABLED:是否開啓cookie。一般不要開啓,避免爬蟲被追蹤到。如果特殊情況也可以開啓。
BOSS職位信息爬取
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from BOS.BOS.items import BosItem
class BosspiderSpider(CrawlSpider):
name = 'Bosspider'
allowed_domains = ['zhipin.com']
start_urls = ['https://www.zhipin.com/job_detail/?query=python&city=100010000&industry=&position=']
rules = (
Rule(LinkExtractor(allow=r'.+\?query=python&page=\d+.+'), follow=True),
Rule(LinkExtractor(allow=r'.+job_detail.+html.+'), callback='parse_job', follow=True),
)
def parse_job(self, response):
info_primary = response.xpath("//div[@class='info-primary']")
name = info_primary.xpath(".//div[@class='name']/h1/text()").get()
salary = info_primary.xpath(".//span[@class='salary']/text()").get()
job_info = info_primary.xpath(".//p/text()").getall()
city = job_info[0].strip()
work_years = job_info[1].strip()
education = job_info[2].strip()
company = response.xpath("//div[@class='company-info']/div[@class='info']/text()").get().strip()
yield BosItem(name=name,salary=salary,city=city,
work_years=work_years,education=education,company=company)
import scrapy
class BosItem(scrapy.Item):
name = scrapy.Field()
salary = scrapy.Field()
city = scrapy.Field()
work_years = scrapy.Field()
education = scrapy.Field()
company = scrapy.Field()
import json
import random
import requests
from scrapy import signals
from twisted.internet.defer import DeferredLock
from BOS.models import ProxyModel
class UserAgentDownloadMiddleware(object):
USER_AGENTS=[
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36",
"Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
"Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
"Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
"Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
"Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
"Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
"Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10",
"Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko ) Version/5.1 Mobile/9B176 Safari/7534.48.3",
]
def process_request(self,request,spider):
user_agent = random.choice(self.USER_AGENTS)
request.headers['User-Agent']=user_agent
class IPProxyDownloadMiddleware(object):
PROXY_URL = ""
def __init__(self):
super(IPProxyDownloadMiddleware,self).__init__()
self.current_proxy = None
self.lock = DeferredLock()
def process_request(self,request,spider):
if 'proxy' not in request.meta or self.current_proxy.is_expiring:
self.update_proxy()
request.meta['proxy'] = self.current_proxy
def process_response(self,request,response,spider):
if response.status != 200 or "captcha" in response.url:
if not self.current_proxy.blacked:
self.current_proxy.blacked = True
print("%s這個代理被加入黑名單了"%self.current_proxy.ip)
self.update_proxy()
return request
return response
def update_proxy(self):
self.lock.acquire()
if not self.current_proxy or self.current_proxy.is_expiring or self.current_proxy.blacked:
response = requests.get(self.PROXY_URL)
text = response.text
print('重新獲取了一個代理:',text)
result = json.loads(text)
if len(result['data']) > 0:
data = result['data'][0]
proxy_model = ProxyModel(data)
self.current_proxy = proxy_model
from datetime import datetime,timedelta
class ProxyModel(object):
def __init__(self,data):
self.ip = data['ip']
self.port = data['port']
self.expire_str = data['expire_time']
self.blacked = False
date_str,time_str = self.expire_str.split("")
year,month,day =date_str.split("-")
hour,minute,second = time_str.split(":")
self.expire_time = datetime(year=int(year),month=int(month),day=int(day),
hour=int(hour),minute=int(minute),second=int(second))
self.proxy = "https://{}:{}".format(self.ip,self.port)
@property
def is_expiring(self):
now = datetime.now()
if (self.expire_time-now) < timedelta(seconds=5):
return True
else:
return False
from scrapy.exporters import JsonLinesItemExporter
class BosPipeline(object):
def __init__(self):
self.fp = open('jobs.json','wb')
self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False)
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self,spider):
self.fp.close()
# -*- coding: utf-8 -*-
# Scrapy settings for BOS project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'BOS'
SPIDER_MODULES = ['BOS.spiders']
NEWSPIDER_MODULE = 'BOS.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'BOS (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36',
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'BOS.middlewares.BosSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'BOS.middlewares.BosDownloaderMiddleware': 543,
'BOS.middlewares.UserAgentDownloadMiddleware':100,
'BOS.middlewares.IPProxyDownloadMiddleware':200,
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'BOS.pipelines.BosPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
from scrapy import cmdline
cmdline.execute("scrapy crawl Bosspider".split())
Selenium+scrapy 爬取簡書網Ajax數據
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from jianshu.items import JianshuItem
class JianshuspiderSpider(CrawlSpider):
name = 'jianshuspider'
allowed_domains = ['jianshu.com']
start_urls = ['http://jianshu.com/']
rules = (
Rule(LinkExtractor(allow=r'.*/p/[a-z0-9]{12}.*'), callback='parse_detail', follow=True),
)
def parse_detail(self, response):
title = response.xpath("//h1[@class='title']/text()").get()
avatar = response.xpath("//a[@class='avatar']/img/@src").get()
author = response.xpath("//span[@class='name']/a/text()").get()
pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*", "")
url = response.url
url1 = url.split("?")[0]
article_id = url1.split('/')[-1]
content = response.xpath("//div[@class='show-content']").get()
word_count = response.xpath("//span[@class='wordage']/text()").get()
comment_count = response.xpath("//span[@class='comments-count']/text()").get()
read_count = response.xpath("//span[@class='views-count']/text()").get()
like_count = response.xpath("//span[@class='likes-count']/text()").get()
subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())
item = JianshuItem(
title=title,
avatar=avatar,
author=author,
pub_time=pub_time,
origin_url=response.url,
article_id=article_id,
content=content,
subjects=subjects,
word_count=word_count,
comment_count=comment_count,
read_count=read_count,
like_count=like_count
)
yield item
import scrapy
class JianshuItem(scrapy.Item):
title = scrapy.Field()
content = scrapy.Field()
article_id = scrapy.Field()
origin_url = scrapy.Field()
author = scrapy.Field()
avatar = scrapy.Field()
pub_time = scrapy.Field()
read_count = scrapy.Field()
like_count = scrapy.Field()
word_count = scrapy.Field()
comment_count = scrapy.Field()
subjects = scrapy.Field()
middlewares.py
若要隨機IP可參考Selenium的添加方式。
import time
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumDownloadMiddleware(object):
def __init__(self):
self.driver = webdriver.Chrome(executable_path=r'/home/kiosk/Desktop/chromedriver')
def process_request(self,request,spider):
self.driver.get(request.url)
time.sleep(1)
try:
while True:
showMore = self.driver.find_element_by_name('show-more')
showMore.click()
time.sleep(0.3)
if not showMore:
break
except:
pass
source = self.driver.page_source
response = HtmlResponse(url=self.driver.current_url,body=source,request=request,encoding='utf-8')
return response
import pymysql
from pymysql import cursors
from twisted.enterprise import adbapi
class JianshuPipeline(object):
def __init__(self):
dbparms = {
'host': '172.25.254.46',
'port': 3306,
'user': 'cooffee',
'password': 'cooffee',
'database': 'jianshu',
'charset': 'utf8'
}
self.conn = pymysql.connect(**dbparms)
self.cursor = self.conn.cursor()
self._sql = None
def process_item(self, item, spider):
self.cursor.execute(self.sql,(item['title'],item['content'],item['author'],item['avatar'],item['pub_time'],item['origin_url'],item['article_id']))
self.conn.commit()
return item
@property
def sql(self):
if not self._sql:
self._sql = '''
insert into article(title,content,author,avatar,pub_time,origin_url,article_id) values(%s,%s,%s,%s,%s,%s,%s)
'''
return self._sql
return self._sql
## ConnectionPool異步存儲數據。
class JianshuTwistedPipeline(object):
def __init__(self):
dbparms = {
'host': '172.25.254.46',
'port': 3306,
'user': 'cooffee',
'password': 'cooffee',
'database': 'jianshu',
'charset': 'utf8',
'cursorclass': cursors.DictCursor
}
self.dbpool = adbapi.ConnectionPool('pymysql',**dbparms)
self._sql = None
@property
def sql(self):
if not self._sql:
self._sql = '''
insert into article(title,content,author,avatar,pub_time,origin_url,article_id) values(%s,%s,%s,%s,%s,%s,%s)
'''
return self._sql
return self._sql
def process_item(self,item,spider):
defer = self.dbpool.runInteraction(self.insert_item,item)
defer.addErrback(self.handle_error,item,spider)
def insert_item(self,cursor,item):
cursor.execute(self.sql, (item['title'], item['content'], item['author'], item['avatar'], item['pub_time'], item['origin_url'],item['article_id']))
def handle_error(self,error,item,spider):
print('='*10+"error"+'='*10)
print(error)
print('='*10+'error'+'='*10)
# -*- coding: utf-8 -*-
# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'jianshu'
SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Mobile Safari/537.36',
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'jianshu.middlewares.JianshuSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
# 'jianshu.middlewares.JianshuDownloaderMiddleware': 543,
'jianshu.middlewares.SeleniumDownloadMiddleware':200,
}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
# 'jianshu.pipelines.JianshuPipeline': 300,
'jianshu.pipelines.JianshuTwistedPipeline':300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
from scrapy import cmdline
cmdline.execute("scrapy crawl jianshuspider".split())