[python爬蟲之路day20]:CrawSpider爬取微信小程序社區技術帖

###CrawSpider:

創建CrawlSpider爬蟲:

之前創建爬蟲的方式是通過scrapy genspider [爬蟲名字] [域名]的方式創建的。如果想要創建CrawlSpider爬蟲,那麼應該通過以下命令創建:

scrapy genspider -t crawl [爬蟲名字] [域名]

LinkExtractors鏈接提取器:

使用LinkExtractors可以不用程序員自己提取想要的url,然後發送請求。這些工作都可以交給LinkExtractors,他會在所有爬的頁面中找到滿足規則的url,實現自動的爬取。以下對LinkExtractors類做一個簡單的介紹:

class scrapy.linkextractors.LinkExtractor(
    allow = (),
    deny = (),
    allow_domains = (),
    deny_domains = (),
    deny_extensions = None,
    restrict_xpaths = (),
    tags = ('a','area'),
    attrs = ('href'),
    canonicalize = True,
    unique = True,
    process_value = None
)

主要參數講解:

  • allow:允許的url。所有滿足這個正則表達式的url都會被提取。
  • deny:禁止的url。所有滿足這個正則表達式的url都不會被提取。
  • allow_domains:允許的域名。只有在這個裏面指定的域名的url纔會被提取。
  • deny_domains:禁止的域名。所有在這個裏面指定的域名的url都不會被提取。
  • restrict_xpaths:嚴格的xpath。和allow共同過濾鏈接。

Rule規則類:

定義爬蟲的規則類。以下對這個類做一個簡單的介紹:

class scrapy.spiders.Rule(
    link_extractor, 
    callback = None, 
    cb_kwargs = None, 
    follow = None, 
    process_links = None, 
    process_request = None
)

主要參數講解:

  • link_extractor:一個LinkExtractor對象,用於定義爬取規則。
  • callback:滿足這個規則的url,應該要執行哪個回調函數。因爲CrawlSpider使用了parse作爲回調函數,因此不要覆蓋parse作爲回調函數自己的回調函數。
  • follow:指定根據該規則從response中提取的鏈接是否需要跟進。
  • process_links:從link_extractor中獲取到鏈接後會傳遞給這個函數,用來過濾不需要爬取的鏈接。

1 allow設置規則的方法:

要能夠限制在想要的url上面,不要跟別的url 產生相同 的正則。
2.什麼情況下使用follow:
爬取頁面時,需要將當前條件的Url進行推進,則爲true,否則是Fasle
3.什麼情況下用callback:
需要爬取該頁面的詳細數據時,用true,否則不用指定。

下面看代碼:
wxapp.spider.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from wxapp.items import WxappItem

class WxappSpiderSpider(CrawlSpider):
    name = 'wxapp_spider'
    allowed_domains = ['wxapp-union.com']
    start_urls = ['http://www.wxapp-union.com/portal.php?mod=list&catid=2&page=1']

    rules = (
        Rule(LinkExtractor(allow=r'.+mod=list&catid=2&page=\d'), follow=True),
        Rule(LinkExtractor(allow=r'.+article-.+\.html'),callback="parse_detail",follow=False)
    )

    def parse_detail(self, response):
        title=response.xpath("//h1[@class='ph']/text()").get()
        author_p=response.xpath("//p[@class='authors']")
        author=author_p.xpath(".//a/text()").get()
        pub_time=author_p.xpath(".//span/text()").get()
        article_content=response.xpath("//td[@id='article_content']//text()").getall()
        content="".join(article_content).strip()
        item=WxappItem(title=title,author=author,pub_time=pub_time,content=content)
        yield item

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class WxappItem(scrapy.Item):
    title=scrapy.Field()
    author=scrapy.Field()
    pub_time=scrapy.Field()
    content=scrapy.Field()
    pass

piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exporters import JsonLinesItemExporter

class WxappPipeline(object):
    def __init__(self):
        self.fp=open("wxjc.json","wb")
        self.exporter=JsonLinesItemExporter(self.fp,ensure_ascii=False,encoding='utf-8')

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item
    def close_spider(self,spider):
        self.fp.close()

settings.py

# -*- coding: utf-8 -*-

# Scrapy settings for wxapp project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wxapp'

SPIDER_MODULES = ['wxapp.spiders']
NEWSPIDER_MODULE = 'wxapp.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wxapp (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 1
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
  'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  'Accept-Language': 'en',
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wxapp.middlewares.WxappSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'wxapp.middlewares.WxappDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wxapp.pipelines.WxappPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl wxapp_spider".split())

如下有保存的數據截圖:
在這裏插入圖片描述
不得不說微信小程序看起來還是挺友好的呢。
下次想衝一下微信小程序。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章