淺嘗python網絡爬蟲,略有心得。有不足之處,請多指正
url = https://wenshu.court.gov.cn/
爬取內容:裁判文書
爬取框架:scrapy框架 + selenium模擬瀏覽器訪問
開始想暴力分析網頁結構獲取數據,哈哈哈哈哈,天真了。看來自己什麼水平還真不知道。
之後鎖定pyspider框架,搞了四五天。該框架對於頁面超鏈接的連續訪問問題,可以手動點擊單個鏈接測試,但是通過外部“run”操作,會獲取不到數據。其實最後發現很多博客說pyspider的官網文檔已經很久沒有更新了,企業、項目一般都會用到scrapy。scrapy框架結構如下圖:
中間爲scrapy引擎,左側item爲爬取內容實體,相關pipeline作用是在yield item語句返回之後,先通過pipelines.py一些類進行處理操作,比如:存入MongoDB中;下面爲spiders實現原始頁面的請求、對響應頁面的解析以及將後續訪問頁面傳輸到上面Scheduler訪問序列中,然後每次從該序列選取一個鏈接通過右邊Downloader訪問Internet。
而其中有兩個中間件,Spider Middlewares和Downloader Middlewares。前者使用頻率不如後者,而下載中間件是當Scheduler發來請求時,可以在訪問Internet之前,對鏈接添加一些參數信息;或者是在返回響應頁面之前,通過selenium進行動態js渲染以及模擬瀏覽器可以實現一些點擊翻頁操作等等。
上面說到一個js渲染問題,這裏使用的是selenium,其實還有一種渲染方法是splash渲染。但是通過splash需要在Spiders代碼中編寫Lua腳本,這種方法實現頁面訪問,翻頁,點擊等的不如selenium來的方便;另一方面,對於大型爬取任務來說,畢竟selenium爲阻塞式訪問界面,這將會給我們的項目平添時耗,而splash可以實現分佈式訪問。
先選擇的是selenium操作,但是由於網站服務器響應問題,需要設置對於一些網頁標籤元素顯示與否的等待時間以及對於整個網頁的模擬訪問timeout時間。期間真的是各種調參,進行單元測試,其實最後頁面訪問成功率基本爲100%。
好奇之下,對於該項目又使用splash進行操作。但是沒能成功將頁面渲染,查閱原因得知可能是網站設置了某些限制splash的js函數。(討論scrapy-splash渲染不成功問題?https://mp.csdn.net/console/editor/html/104331571)
代碼爲爬取前兩頁數據,單頁文書數量爲5,通過selenium設置單頁文書數量爲最大值15,這樣可以省去很多主頁面的跳轉、訪問、解析時間。由於單頁15個訪問鏈接,外加2個文書列表界面,總共32次響應返回,響應碼全部爲"200"(成功響應)。以下爲代碼執行結果圖:
此時數據庫中顯示插入的30條文書實體數據,以下爲MongoDB中相應collection更新的數據:
踩坑合集(bugs):
1.詳情頁解析函數測試,通過xpath或css函數得到標籤元素而返回的對象爲"list"類型。而之後使用extract(),返回的也是"list"類型,但是不可以繼續調用xpath或css函數;
2.通過selenium模擬瀏覽器操作,對一些標籤元素的顯示設置等待時間以及整個訪問頁面響應的超時時間。對時間多少,標籤元素的選擇等進行測試;
3.將文書主體內容不同部分:當事人、法院原由、判決結果進行拆分。通過div個數以及一些特殊情況的解決,可以解決不同的div標籤個數情況下的數據清洗。以及對於一些不授權展示的特殊情況的處理。
4.開始時,爲了通過一個下載中間件響應文書列表和文書詳情兩種頁面。解決辦法:通過meta字段,設置哨兵區別兩種不同鏈接。
5.由於不論是第幾頁發送的請求,鏈接總是不變的。實際操作中,點擊後續頁面只是內容發生變化而鏈接自始至終不會改變。所以在spiders代碼中需要設置訪問鏈接的"dont_filter"參數爲true,即不對重複鏈接過濾。
6.由於5造成的影響,如果單純模擬點擊下一頁鏈接,到了第二頁還是原來鏈接,再點擊下一頁還會回到第一頁。所以需要直接點擊模擬點擊對應的頁碼,但是開始最多隻顯示6個可選頁碼按鈕點擊6頁面纔會顯示後續頁面按鈕。所以通過規律,編寫函數獲取點擊的按鈕序列。
7.由於6中對頁碼數較大的頁面訪問需要模擬多次點擊按鈕跳轉,所以可以首先模擬選擇下面單頁文書列表可以展示的文書數量爲最大15,是原始的3倍,因此翻頁的次數大大降低,以提高爬取效率。
好了,話不多說。上代碼:
spiders:
# -*- coding: utf-8 -*-
import time
import scrapy
from scrapy import Request
from urllib.parse import urljoin
from scrapy import Selector
from wenshu.items import WenshuItem
import requests
from scrapy import Spider
from scrapy_splash import SplashRequest
from urllib.parse import quote
script_one = """
function main(splash, args)
splash.html5_media_enabled = true
splash.plugins_enabled = true
splash.response_body_enabled = true
splash.request_body_enabled = true
splash.js_enabled = true
splash.resource_timeout = 30
splash.images_enabled = false
assert(splash:go(args.url))
assert(splash:wait(args.wait))
return { html = splash:html(),
har = splash:har(),
png = splash:png()
}
end
"""
script = """
function main(splash, args)
splash.resource_timeout = 40
splash.images_enabled = true
assert(splash:go(args.url))
assert(splash:wait(args.wait))
js = string.format("isnext = function(page = %s) { if(page == 1) { return document.html; } document.querySelector('a.pageButton:nth-child(8)').click(); return document.html; }",args.page)
splash:runjs(js)
splash:evaljs(js)
assert(splash:wait(args.wait))
return splash:html()
end
"""
detail = '''
function main(splash, args)
splash.resource_timeout = 20
splash.images_enable = false
assert(splash:go(args.url))
assert(splash:wait(args.wait))
return splash:html()
end
'''
class WsSpider(scrapy.Spider):
name = 'ws'
allowed_domains = ['wenshu.court.gov.cn']
base_url = 'https://wenshu.court.gov.cn/website/wenshu'
#start_urls = 'https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?pageId=11c90f39fb379fbf4ab85bb180682ce0&s38=300&fymc=%E6%B2%B3%E5%8C%97%E7%9C%81%E9%AB%98%E7%BA%A7%E4%BA%BA%E6%B0%91%E6%B3%95%E9%99%A2'
start_urls = 'https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?s38=300&fymc=%E6%B2%B3%E5%8C%97%E7%9C%81%E9%AB%98%E7%BA%A7%E4%BA%BA%E6%B0%91%E6%B3%95%E9%99%A2'
def start_requests(self):
for page in range(1, self.settings.get('MAX_PAGE') + 1):
self.logger.debug(str(page))
#if page == 1:
# self.logger.debug(self.settings.get('MAX_PAGE'))
# self.logger.debug('=====================================')
#elif page == 2:
# self.logger.debug('+++++++++++++++++++++++++++++++++++++')
yield Request(url=self.start_urls, callback=self.parse_origin, meta={'page':page,'tag':0}, dont_filter=True)
#yield SplashRequest(url=self.start_urls, callback=self.parse_origin, endpoint='execute', args={'lua_source':script_one,
# 'wait':5,'page':page
#})
def parse_origin(self, response):
#self.logger.debug(response.text)
urls = response.xpath('//a[@class="caseName"]/@href').extract()
self.logger.debug(str(response.status))
self.logger.debug(str(response.status))
self.logger.debug(str(response.status))
for url in urls:
target_url = self.base_url + url[2:]
self.logger.debug('url: ' + url)
self.logger.debug('target_url: ' + target_url)
for url in urls:
#each_url = url.xpath('./@href').extract_first()
# self.logger.debug(str(url))
target_url = self.base_url + url[2:]
yield Request(url=target_url, callback=self.parse_detail, meta={'tag':1}, dont_filter=False)
#def parse_detail()
items:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
from scrapy import Item,Field
class WenshuItem(Item):
# define the fields for your item here like:
# name = scrapy.Field()
collection = 'wenshu'
title = Field() # 標題
release = Field() # 發佈時間
views = Field() # 訪問量
court = Field() # 審判法院
type = Field() # 文書類型
prelude = Field() # 首部
parties = Field() # 當事人
justification = Field() # 理由
end = Field() # 結果
chief = Field() # 審判長
judge = Field() # 審判員
time = Field() # 審判時間
assistant = Field() # 法官助理
clerk = Field() # 書記員
pipelines:
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from wenshu.items import WenshuItem
class WsMongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls,crawler):
return cls (
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
if isinstance(item, WenshuItem):
self.db[item.collection].insert(dict(item))
return item
class WenshuPipeline(object):
def process_item(self, item, spider):
return item
middlewares:
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import time
from scrapy import signals
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from scrapy.http import HtmlResponse
from logging import getLogger
from selenium.webdriver.support.select import Select
class WenshuSeleniumMiddleware(object):
def __init__(self, timeout=None, service_args=[]):
self.logger = getLogger(__name__)
self.timeout = timeout
self.browser = webdriver.Chrome()
self.browser.set_window_size(1200,600)
self.browser.set_page_load_timeout(self.timeout)
self.wait = WebDriverWait(self.browser, self.timeout)
def __del__(self):
self.browser.close()
#def process_request()
class WenshuSpiderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, dict or Item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request, dict
# or Item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class WenshuDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
settings:
BOT_NAME = 'wenshu'
SPIDER_MODULES = ['wenshu.spiders']
NEWSPIDER_MODULE = 'wenshu.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'wenshu.pipelines.WsMongoPipeline': 300,
# 'wenshu.pipelines.MongoPipeline': 302,
}
DOWNLOADER_MIDDLEWARES = {
'wenshu.middlewares.WenshuSeleniumMiddleware': 543,
#'scrapy_splash.SplashCookiesMiddleware': 723,
#'scrapy_splash.SplashMiddleware': 725,
#'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810
}
MAX_PAGE = 2
# SPLASH_URL = 'http://127.0.0.1:8050'
SELENIUM_TIMEOUT = 60
PHANTOMJS_SERVICE_ARGS = ['--load-images=false', '--disk-cache=true']
# DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
MONGO_URI = 'localhost'
MONGO_DB = 'test'