scrapy-splash學習

材料清單

docker

scrapy


當我們經常遇到js加載的頁面,用scrapy來抓取其實挺麻煩的。Splash是做來加載渲染後的頁面,可以支持scrapy使用。由於Splash和Scrapy都支持異步處理,而Selenium的對接過程中每個頁面渲染下載過程是在Downloader Middleware裏面完成的,所以整個過程是堵塞式的,Scrapy會等待這個過程完成後再繼續處理和調度其他請求,影響了爬取效率,因此使用Splash爬取效率上比Selenium高出很多。
首先安裝docker,直接拉取鏡像 docker pull scrapinghub/splash
啓動Splashdocker run -p 8050:8050 scrapinghub/splash
然後測試一下是否可以連上curl http://localhost:8050


如果關閉防火牆之類操作已經做完了,那麼遠程是可以連接上splash的
在這裏插入圖片描述


接着開始在scrapy的配置,在settings.py中添加如下配置

# 加入splash的url以及去重類
SPLASH_URL = 'http://192.168.99.100:8050'  
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'  
# 修改下載中間件
DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 723,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
    #'splash_163.middlewares.Splash163DownloaderMiddleware': 543,
}
# 修改爬蟲中間件
SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

在spider中加入生成的splash請求

from scrapy_splash import SplashRequest
...
...
yield SplashRequest(url, callback=self.parse_result,
    args={
        # optional; parameters passed to Splash HTTP API
        'wait': 0.5,
        # 'url' is prefilled from request url
        # 'http_method' is set to 'POST' for POST requests
        # 'body' is set to request body for POST requests
    },
    endpoint='render.json', # optional; default is render.html
    splash_url='<url>',     # optional; overrides SPLASH_URL
)

# 另外我們也可以生成Request對象,關於Splash的配置通過meta屬性配置即可,代碼如下:
yield scrapy.Request(url, self.parse_result, meta={
    'splash': {
        'args': {
            # set rendering arguments here
            'html': 1,
            'png': 1,
            # 'url' is prefilled from request url
            # 'http_method' is set to 'POST' for POST requests
            # 'body' is set to request body for POST requests
        },
        # optional parameters
        'endpoint': 'render.json',  # optional; default is render.json
        'splash_url': '<url>',      # optional; overrides SPLASH_URL
        'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
        'splash_headers': {},       # optional; a dict with headers sent to Splash
        'dont_process_response': True, # optional, default is False
        'dont_send_headers': True,  # optional, default is False
        'magic_response': False,    # optional, default is True
    }
})

當我們使用腳本來執行某些操作時,就需要Lua腳本了。Lua腳本可以像selenium那樣來實現頁面加載、模擬點擊翻頁的功能

script = """
function main(splash, args)
  args = {
    url="https://s.taobao.com/search?q=羽毛球",
    wait=5,
    page=5
  }
  splash.images_enabled = false
  assert(splash:go(args.url))
  assert(splash:wait(args.wait))
  js = string.format("document.querySelector('#mainsrp-pager div.form > input').value=%d;document.querySelector('#mainsrp-pager div.form > span.btn.J_Submit').click()", args.page)
  splash:evaljs(js)
  assert(splash:wait(args.wait))
  return splash:png()
end
"""
class TaobaoSpider(Spider):
    name = 'taobao'
    allowed_domains = ['www.taobao.com']
    base_url = 'https://s.taobao.com/search?q='
    
    def start_requests(self):
        for keyword in self.settings.get('KEYWORDS'):
            for page in range(1, self.settings.get('PAGE_NUM') + 1):
                url = self.base_url + quote(keyword)
                yield SplashRequest(url, callback=self.parse, endpoint='execute', 
                                    args={'lua_source': script, 'page': page, 'wait': 3})

順便貼個post請求的Lua腳本

script = """
function main(splash, args)
  local treat = require("treat")
  local json = require("json")
  local response = splash:http_post{args.url, 
  					body=json.encode({keywords="園林"})}
  splash:wait(10)
  return {
    html = treat.as_string(response.body),
    url = response.url,
    status = response.status
  }
end
"""
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章