scrapy的下載器中間件及配置文件

Downloader Middlewares(下載器中間件)

下載器中間件是引擎和下載器之間通信的中間件。在這個中間件中我們可以設置代理、更換請求頭等來達到反反爬蟲的目的。要寫下載器中間件,可以在下載器中實現兩個方法。一個是process_request(self,request,spider),這個方法是在請求發送之前會執行,還有一個是process_response(self,request,response,spider),這個方法是數據下載到引擎之前執行。

process_request(self,request,spider):

這個方法是下載器在發送請求之前會執行的。一般可以在這個裏面設置隨機代理ip等。

  1. 參數:
    request:發送請求的request對象。spider:發送請求的spider對象。
  2. 返回值:
    返回None:如果返回None,Scrapy將繼續處理該request,執行其他中間件中的相應方法,直到合適的下載器處理函數被調用。
    返回Response對象:Scrapy將不會調用任何其他的process_request方法,將直接返回這個response對象。已經激活的中間件的process_response()方法則會在每個response返回時被調用。
    返回Request對象:不再使用之前的request對象去下載數據,而是根據現在返回的request對象返回數據。
    如果這個方法中拋出了異常,則會調用process_exception方法。

process_response(self,request,response,spider):

這個是下載器下載的數據到引擎中間會執行的方法。

  1. 參數:
    request:request對象。response:被處理的response對象。spider:spider對象。
  2. 返回值:
    返回Response對象:會將這個新的response對象傳給其他中間件,最終傳給爬蟲。
    返回Request對象:下載器鏈被切斷,返回的request會重新被下載器調度下載。
    如果拋出一個異常,那麼調用request的errback方法,如果沒有指定這個方法,那麼會拋出一個異常。

隨機請求頭中間件:

隨機更改請求頭,可以在下載中間件中實現。在請求發送給服務器之前,隨機的選擇一個請求頭。這樣就可以避免總使用一個請求頭了。
user-agent列表:http://www.useragentstring.com/pages/useragentstring.php?typ=Browser

class UserAgentDownloadMiddleware(object):
    USER_AGENTS=[
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36",
        "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
        "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
        "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
        "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
        "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
        "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0",
    ]

    def process_request(self,request,spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent']=user_agent

ip代理池中間件

芝麻代理:http://http.zhimaruanjian.com/
太陽代理:http://http.taiyangruanjian.com/
快代理:http://www.kuaidaili.com/
訊代理:http://www.xdaili.cn/
螞蟻代理:http://www.mayidaili.com/

  1. 開放池:
class IPProxyDownloadMiddleware(object):
     PROXIES = [
        "ip:端口",
        "ip:端口",
        "ip:端口",
     ]
     def process_request(self,request,spider):
           proxy = random.choice(self.PROXIES)
           print('被選中的代理:%s' % proxy)
           request.meta['proxy'] = "http://" + proxy
  1. 獨享代理池:
 class IPProxyDownloadMiddleware(object):
     def process_request(self,request,spider):
         proxy = 'ip:端口'
         user_password = "xxxx:xxxx"
         request.meta['proxy'] = proxy
         # bytes
         b64_user_password = base64.b64encode(user_password.encode('utf-8'))
         request.headers['Proxy-Authorization'] = 'Basic ' + b64_user_password.decode('utf-8')

setting配置信息

  1. BOT_NAME:項目名稱。
  2. ROBOTSTXT_OBEY:是否遵守爬蟲協議。默認不遵守。
  3. CONCURRENT_ITEMS:代表pipeline同時處理的item數的最大值。默認是100
  4. CONCURRENT_REQUESTS:代表下載器併發請求的最大是,默認是16。
  5. DEFAULT_REQUEST_HEADERS:默認請求頭。可以將一些不會經常變化的請求頭放在這個裏面。
  6. DEPTH_LIMIT:爬取網站最大允許的深度。默認爲0,如果爲0,則沒有限制。
  7. DOWNLOAD_DELAY:下載器在下載某個頁面前等待多長的時間。該選項用來限制爬蟲的爬取速度,減輕服務器壓力。同時也支持小數。
  8. DOWNLOAD_TIMEOUT:下載器下載的超時時間。
  9. ITEM_PIPELINES:處理item的Pipeline,是一個字典,字典的key這個pipeline所在包的絕對路徑,值是一個整數,優先級,值越小,優先級越高。
  10. LOG_ENABLED:是否啓用logging。默認是True。
  11. LOG_ENCODING:log的編碼。
  12. LOG_LEVEL:log的級別。默認爲DEBUG。可選的級別有CRITICAL、ERROR、WARNING、INFO、DEBUG。
  13. USER_AGENT:請求頭。默認爲Scrapy/VERSION (+http://scrapy.org)。
  14. PROXIES:代理設置。
  15. COOKIES_ENABLED:是否開啓cookie。一般不要開啓,避免爬蟲被追蹤到。如果特殊情況也可以開啓。

BOSS職位信息爬取

Bosspider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from BOS.BOS.items import BosItem
class BosspiderSpider(CrawlSpider):
    name = 'Bosspider'
    allowed_domains = ['zhipin.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python&city=100010000&industry=&position=']

    rules = (
        Rule(LinkExtractor(allow=r'.+\?query=python&page=\d+.+'), follow=True),
        Rule(LinkExtractor(allow=r'.+job_detail.+html.+'), callback='parse_job', follow=True),
    )

    def parse_job(self, response):
        info_primary = response.xpath("//div[@class='info-primary']")
        name = info_primary.xpath(".//div[@class='name']/h1/text()").get()
        salary = info_primary.xpath(".//span[@class='salary']/text()").get()
        job_info = info_primary.xpath(".//p/text()").getall()
        city = job_info[0].strip()
        work_years = job_info[1].strip()
        education = job_info[2].strip()
        company = response.xpath("//div[@class='company-info']/div[@class='info']/text()").get().strip()
        yield BosItem(name=name,salary=salary,city=city,
                      work_years=work_years,education=education,company=company)

items.py

import scrapy
class BosItem(scrapy.Item):
    name = scrapy.Field()
    salary = scrapy.Field()
    city = scrapy.Field()
    work_years = scrapy.Field()
    education = scrapy.Field()
    company = scrapy.Field()

middlewares.py

import json
import random

import requests
from scrapy import signals
from twisted.internet.defer import DeferredLock

from BOS.models import ProxyModel
class UserAgentDownloadMiddleware(object):
    USER_AGENTS=[
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2226.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36",
        "Opera/9.80 (X11; Linux i686; Ubuntu/14.10) Presto/2.12.388 Version/12.16",
        "Opera/9.80 (Windows NT 6.0) Presto/2.12.388 Version/12.14",
        "Mozilla/5.0 (Windows NT 6.0; rv:2.0) Gecko/20100101 Firefox/4.0 Opera 12.14",
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0) Opera 12.14",
        "Opera/12.80 (Windows NT 5.1; U; en) Presto/2.10.289 Version/12.02",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1",
        "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0",
        "Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0",
        "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/7046A194A",
        "Mozilla/5.0 (iPad; CPU OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5355d Safari/8536.25",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/537.13+ (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10",
        "Mozilla/5.0 (iPad; CPU OS 5_1 like Mac OS X) AppleWebKit/534.46 (KHTML, like Gecko ) Version/5.1 Mobile/9B176 Safari/7534.48.3",
    ]

    def process_request(self,request,spider):
        user_agent = random.choice(self.USER_AGENTS)
        request.headers['User-Agent']=user_agent

class IPProxyDownloadMiddleware(object):
    PROXY_URL = ""

    def __init__(self):
        super(IPProxyDownloadMiddleware,self).__init__()
        self.current_proxy = None
        self.lock = DeferredLock()

    def process_request(self,request,spider):
        if 'proxy' not in request.meta or self.current_proxy.is_expiring:
            self.update_proxy()
        request.meta['proxy'] = self.current_proxy

    def process_response(self,request,response,spider):
        if response.status != 200 or "captcha" in response.url:
            if not self.current_proxy.blacked:
                self.current_proxy.blacked = True
            print("%s這個代理被加入黑名單了"%self.current_proxy.ip)
            self.update_proxy()

            return request
        return response

    def update_proxy(self):
        self.lock.acquire()
        if not self.current_proxy or self.current_proxy.is_expiring or self.current_proxy.blacked:
            response = requests.get(self.PROXY_URL)
            text = response.text
            print('重新獲取了一個代理:',text)
            result = json.loads(text)
            if len(result['data']) > 0:
                data = result['data'][0]
                proxy_model = ProxyModel(data)
                self.current_proxy = proxy_model

models.py

from datetime import datetime,timedelta

class ProxyModel(object):

    def __init__(self,data):
        self.ip = data['ip']
        self.port = data['port']
        self.expire_str = data['expire_time']
        self.blacked = False

        date_str,time_str = self.expire_str.split("")
        year,month,day =date_str.split("-")
        hour,minute,second = time_str.split(":")
        self.expire_time = datetime(year=int(year),month=int(month),day=int(day),
                                    hour=int(hour),minute=int(minute),second=int(second))
        self.proxy = "https://{}:{}".format(self.ip,self.port)

    @property
    def is_expiring(self):
        now = datetime.now()
        if (self.expire_time-now) < timedelta(seconds=5):
            return True
        else:
            return False

pipelines.py

from scrapy.exporters import JsonLinesItemExporter


class BosPipeline(object):
    def __init__(self):
        self.fp = open('jobs.json','wb')
        self.exporter = JsonLinesItemExporter(self.fp,ensure_ascii=False)

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

    def close_spider(self,spider):
        self.fp.close()

setting.py

# -*- coding: utf-8 -*-

# Scrapy settings for BOS project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BOS'

SPIDER_MODULES = ['BOS.spiders']
NEWSPIDER_MODULE = 'BOS.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'BOS (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 '
                 '(KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BOS.middlewares.BosSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'BOS.middlewares.BosDownloaderMiddleware': 543,
    'BOS.middlewares.UserAgentDownloadMiddleware':100,
    'BOS.middlewares.IPProxyDownloadMiddleware':200,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'BOS.pipelines.BosPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl Bosspider".split())

Selenium+scrapy 爬取簡書網Ajax數據

jianshuspider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from jianshu.items import JianshuItem


class JianshuspiderSpider(CrawlSpider):
    name = 'jianshuspider'
    allowed_domains = ['jianshu.com']
    start_urls = ['http://jianshu.com/']

    rules = (
        Rule(LinkExtractor(allow=r'.*/p/[a-z0-9]{12}.*'), callback='parse_detail', follow=True),
    )

    def parse_detail(self, response):
        title = response.xpath("//h1[@class='title']/text()").get()
        avatar = response.xpath("//a[@class='avatar']/img/@src").get()
        author = response.xpath("//span[@class='name']/a/text()").get()
        pub_time = response.xpath("//span[@class='publish-time']/text()").get().replace("*", "")
        url = response.url
        url1 = url.split("?")[0]
        article_id = url1.split('/')[-1]
        content = response.xpath("//div[@class='show-content']").get()

        word_count = response.xpath("//span[@class='wordage']/text()").get()
        comment_count = response.xpath("//span[@class='comments-count']/text()").get()
        read_count = response.xpath("//span[@class='views-count']/text()").get()
        like_count = response.xpath("//span[@class='likes-count']/text()").get()
        subjects = ",".join(response.xpath("//div[@class='include-collection']/a/div/text()").getall())

        item = JianshuItem(
            title=title,
            avatar=avatar,
            author=author,
            pub_time=pub_time,
            origin_url=response.url,
            article_id=article_id,
            content=content,
            subjects=subjects,
            word_count=word_count,
            comment_count=comment_count,
            read_count=read_count,
            like_count=like_count
        )
        yield item

items.py

import scrapy


class JianshuItem(scrapy.Item):
    title = scrapy.Field()
    content = scrapy.Field()
    article_id = scrapy.Field()
    origin_url = scrapy.Field()
    author = scrapy.Field()
    avatar = scrapy.Field()
    pub_time = scrapy.Field()
    read_count = scrapy.Field()
    like_count = scrapy.Field()
    word_count = scrapy.Field()
    comment_count = scrapy.Field()
    subjects = scrapy.Field()

middlewares.py
若要隨機IP可參考Selenium的添加方式。

import time
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SeleniumDownloadMiddleware(object):
    def __init__(self):
        self.driver = webdriver.Chrome(executable_path=r'/home/kiosk/Desktop/chromedriver')

    def process_request(self,request,spider):
        self.driver.get(request.url)
        time.sleep(1)
        try:
            while True:
                showMore = self.driver.find_element_by_name('show-more')
                showMore.click()
                time.sleep(0.3)
                if not showMore:
                    break
        except:
            pass
        source = self.driver.page_source
        response = HtmlResponse(url=self.driver.current_url,body=source,request=request,encoding='utf-8')
        return response

pipelines.py

import pymysql
from pymysql import cursors
from twisted.enterprise import adbapi
class JianshuPipeline(object):
    def __init__(self):
        dbparms = {
            'host': '172.25.254.46',
            'port': 3306,
            'user': 'cooffee',
            'password': 'cooffee',
            'database': 'jianshu',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparms)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        self.cursor.execute(self.sql,(item['title'],item['content'],item['author'],item['avatar'],item['pub_time'],item['origin_url'],item['article_id']))
        self.conn.commit()
        return item

    @property
    def sql(self):
        if not self._sql:
            self._sql = '''
            insert into article(title,content,author,avatar,pub_time,origin_url,article_id) values(%s,%s,%s,%s,%s,%s,%s)
            '''
            return self._sql
        return self._sql

## ConnectionPool異步存儲數據。
class JianshuTwistedPipeline(object):
    def __init__(self):
        dbparms = {
            'host': '172.25.254.46',
            'port': 3306,
            'user': 'cooffee',
            'password': 'cooffee',
            'database': 'jianshu',
            'charset': 'utf8',
            'cursorclass': cursors.DictCursor
        }
        self.dbpool = adbapi.ConnectionPool('pymysql',**dbparms)
        self._sql = None

    @property
    def sql(self):
        if not self._sql:
            self._sql = '''
                insert into article(title,content,author,avatar,pub_time,origin_url,article_id) values(%s,%s,%s,%s,%s,%s,%s)
                '''
            return self._sql
        return self._sql

    def process_item(self,item,spider):
        defer = self.dbpool.runInteraction(self.insert_item,item)
        defer.addErrback(self.handle_error,item,spider)

    def insert_item(self,cursor,item):
        cursor.execute(self.sql, (item['title'], item['content'], item['author'], item['avatar'], item['pub_time'], item['origin_url'],item['article_id']))

    def handle_error(self,error,item,spider):
        print('='*10+"error"+'='*10)
        print(error)
        print('='*10+'error'+'='*10)

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
  'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) '
               'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Mobile Safari/537.36',
}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'jianshu.middlewares.JianshuSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   # 'jianshu.middlewares.JianshuDownloaderMiddleware': 543,
    'jianshu.middlewares.SeleniumDownloadMiddleware':200,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   # 'jianshu.pipelines.JianshuPipeline': 300,
    'jianshu.pipelines.JianshuTwistedPipeline':300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl jianshuspider".split())
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章