Scrapy爬取數據案例

原創

G_Q_L

2020-02-24 01:56

Scrapy爬取數據

昨天練習了一個簡單例子，今天進行進一步的學習——爬取同域下的多個網頁。

對 Tencent 的招聘信息進行爬取：
騰訊招聘

可以看到需要的信息都在< tbody >下的class爲even和odd< tr > 標籤中。

< tr > 中的需求的數據爲職位名、職位鏈接、職位人數、職位類別、工作地點和發佈日期。

使用xpath可以很容易得到這些數據的獲取方式：
數據集：xpath(“//tr[@class=’even’] | //tr[@class=’odd’]”)
職位名：xpath(“./td[1]/a/text()”)
職位鏈接：xpath(“./td[1]/a/@href”)
職位人數：xpath(“./td[3]/text()”)
職位類別：xpath(“./td[2]/text()”)
工作地點：xpath(“./td[4]/text()”)
發佈日期：xpath(“./td[5]/text()”)

1.爬取方式一

新建工程Tencent

scrapy startproject Tencent
cd myproject
創建spider

scrapy genspider Tenspider “tencent.com”

修改setting

    # Obey robots.txt rules
    ROBOTSTXT_OBEY = False

    ITEM_PIPELINES = {
    'Tencent.pipelines.TencentPipeline': 300,
    }

編寫管道

import json 
class TencentPipeline(object):
    def __init__(self):
        self.f = open("Tencent.json","wb+")

    def process_item(self, item, spider):
        content = json.dumps(dict(item),ensure_ascii = False) + ",\n"
        self.f.write(content.encode("utf-8"))
        return item

    def colse_spider(self,spider):
        self.f.close()

編寫item

import scrapy   
class TencentItem(scrapy.Item):
    #職位名
    positio_name = scrapy.Field()
    #職位鏈接
    position_link = scrapy.Field()
    #職位類型
    poistion_type = scrapy.Field()
    #招聘的人數
    people_num = scrapy.Field()
    #工作地點
    work_location = scrapy.Field()
    #發佈的時間
    publish_time = scrapy.Field()
    pass

編寫spider

import scrapy
from Tencent.items import TencentItem

class TenspiderSpider(scrapy.Spider):
    #爬蟲名
    name = 'TenSpider'
    #爬蟲的爬取域
    allowed_domains = ['tencent.com']
    #處理翻頁請求
    baseURL="http://hr.tencent.com/position.php?&start="
    #頁面的偏移量,即baseUrl中的start的值
    offset = 0
    #起始的URL列表
    start_urls = [baseURL + str(offset)]

    def parse(self, response):
        item = TencentItem()
        #數據集
        node_list = response.xpath("//tr[@class='even'] | //tr[@class='odd']")
        #提取數據，並將其轉換爲utf-8編碼，然後傳入item中
        for node in node_list:
            item["positio_name"] =node.xpath("./td[1]/a/text()").extract()
            item["position_link"] =node.xpath("./td[1]/a/@href").extract()
            item["people_num"] =node.xpath("./td[3]/text()").extract()
            item["work_location"] =node.xpath("./td[4]/text()").extract()
            item["publish_time"] =node.xpath("./td[5]/text()").extract()
            #有類別爲空的情況
            if len(node.xpath("./td[2]/text()")):
                item["poistion_type"] =node.xpath("./td[2]/text()").extract()
            else:
                item["poistion_type"] = ""

            yield item

        if self.offset < 2100:
            self.offset +=10
            url = self.baseURL + str(self.offset)
            yield scrapy.Request(url,callback = self.parse)

        pass

執行spider
scrapy genspider Tenspider

爬取時間花了1分半。。

2.爬取方式二

上面的爬取需要事先確定頁數，一旦頁數發生變換，代碼就得改動，很不方便。這主要是用於無法提取下一頁鏈接的情況。第二種方式是自動提取下一頁鏈接，事先無需知道頁數。

從網頁element中可以看到，下一頁的導航規律：
第一頁的element：

最後一頁的element：

可以看出：終止條件爲達到最後一頁，即檢索到id爲next且class爲noactive的< a />標籤。

所以 xpath(“//a[@id=’next’ and @class=’noactive’]”),只要有匹配項就說明符合終止條件了。

如果沒到終止條件，那麼就提取出下一頁的鏈接：
xpath(“//a[@id=’next’]/@href”)

方式二爬取數據

找到將方式一的spider文件中的以下代碼：

if self.offset < 2100:
                self.offset +=10
                url = self.baseURL + str(self.offset)
                yield scrapy.Request(url,callback = self.parse)

替換成：

if len(response.xpath("//a[@id='next' and @class='noactive']")) ==0:
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            yield scrapy.Request("http://hr.tencent.com/" + url,callback = self.parse)

執行spider
爬取完成：

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Scrapy爬取數據案例

Scrapy爬取數據

1.爬取方式一

2.爬取方式二

方式二爬取數據

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

初用WebService

爬蟲Scrapy框架的安裝配置

向MVC的Model中添加驗證

CLR via C#垃圾回收

Scrapy命令和 User Agent

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結