###CrawSpider:
創建CrawlSpider爬蟲:
之前創建爬蟲的方式是通過scrapy genspider [爬蟲名字] [域名]
的方式創建的。如果想要創建CrawlSpider
爬蟲,那麼應該通過以下命令創建:
scrapy genspider -t crawl [爬蟲名字] [域名]
LinkExtractors鏈接提取器:
使用LinkExtractors
可以不用程序員自己提取想要的url,然後發送請求。這些工作都可以交給LinkExtractors
,他會在所有爬的頁面中找到滿足規則的url
,實現自動的爬取。以下對LinkExtractors
類做一個簡單的介紹:
class scrapy.linkextractors.LinkExtractor(
allow = (),
deny = (),
allow_domains = (),
deny_domains = (),
deny_extensions = None,
restrict_xpaths = (),
tags = ('a','area'),
attrs = ('href'),
canonicalize = True,
unique = True,
process_value = None
)
主要參數講解:
- allow:允許的url。所有滿足這個正則表達式的url都會被提取。
- deny:禁止的url。所有滿足這個正則表達式的url都不會被提取。
- allow_domains:允許的域名。只有在這個裏面指定的域名的url纔會被提取。
- deny_domains:禁止的域名。所有在這個裏面指定的域名的url都不會被提取。
- restrict_xpaths:嚴格的xpath。和allow共同過濾鏈接。
Rule規則類:
定義爬蟲的規則類。以下對這個類做一個簡單的介紹:
class scrapy.spiders.Rule(
link_extractor,
callback = None,
cb_kwargs = None,
follow = None,
process_links = None,
process_request = None
)
主要參數講解:
- link_extractor:一個
LinkExtractor
對象,用於定義爬取規則。 - callback:滿足這個規則的url,應該要執行哪個回調函數。因爲
CrawlSpider
使用了parse
作爲回調函數,因此不要覆蓋parse
作爲回調函數自己的回調函數。 - follow:指定根據該規則從response中提取的鏈接是否需要跟進。
- process_links:從link_extractor中獲取到鏈接後會傳遞給這個函數,用來過濾不需要爬取的鏈接。
1 allow設置規則的方法:
要能夠限制在想要的url上面,不要跟別的url 產生相同 的正則。
2.什麼情況下使用follow:
爬取頁面時,需要將當前條件的Url進行推進,則爲true,否則是Fasle
3.什麼情況下用callback:
需要爬取該頁面的詳細數據時,用true,否則不用指定。
下面看代碼:
wxapp.spider.py
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem
class WxappSpiderSpider(CrawlSpider):
name = 'wxapp_spider'
allowed_domains = ['wxapp-union.com']
start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']
rules = (
Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),
Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False)
)
def parse_detail(self, response):
title=response.xpath("//h1[@class='ph']/text()").get()
author_p=response.xpath("//p[@class='authors']")
author=author_p.xpath(".//a/text()").get()
pub_time=author_p.xpath(".//span/text()").get()
article_content=response.xpath("//td[@id='article_content']//text()").getall()
content="".join(article_content).strip()
item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)
yield item
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class WxappItem(scrapy.Item):
title=scrapy.Field()
author=scrapy.Field()
pub_time=scrapy.Field()
content=scrapy.Field()
pass
piplines.py
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter
class WxappPipeline(object):
def __init__(self):
self.fp=open("wxjc.json","wb")
self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')
def process_item(self, item, spider):
self.exporter.export_item(item)
return item
def close_spider(self,spider):
self.fp.close()
settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for wxapp project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'wxapp'
SPIDER_MODULES = ['wxapp.spiders']
NEWSPIDER_MODULE = 'wxapp.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wxapp (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'wxapp.middlewares.WxappSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'wxapp.middlewares.WxappDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'wxapp.pipelines.WxappPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
start.py
from scrapy import cmdline
cmdline.execute("scrapy crawl wxapp_spider".split())
如下有保存的數據截圖:
不得不說微信小程序看起來還是挺友好的呢。
下次想衝一下微信小程序。