網絡爬蟲 | 騰訊招聘信息採集——基於Python中Scrapy框架

目錄

1.爬蟲項目工具:

2.命令行scrapy重要參數:

3.爬蟲製作步驟:

4.爬蟲類Scrapy源碼剖析:

5.爬蟲運作流程:

6.爬取內容保存格式:

7.python3數據類型轉換函數:

8.網頁數據的抓取與保存

9.部分爬取結果展示


1.爬蟲項目工具:

  • PyCharm集成開發環境
  • Google瀏覽器
  • Google瀏覽器插件:xpath helper, json viewer視圖插件

2.命令行scrapy重要參數:

  • check : 檢查項目,並由crawl返回
  • crawl : 運行爬蟲抓取數據
  • edit : 通過編輯器編輯爬蟲文件
  • fentch : 使用Scrapy downloader 提取的 URL
  • genspider : 創建爬蟲文件
  • list : 顯示本項目中可用的爬蟲列表
  • shell : 啓動給定URL的一個交互式模塊
  • startproject : 創建了一個新的 Scrapy 項目

3.爬蟲製作步驟:

  • 新建爬蟲項目(scrapy startproject Tencent)
  • 創建爬蟲文件(scrapy genspider tencent "https://hr.tencent.com/")
  • 明確抓取內容(編寫 items.py)
  • 編寫爬蟲程序(編寫 spiders/spider.py)
  • 編寫管道文件(編寫 pipelines.py, 存儲爬取內容)
  • 設置配置文件(編寫settings.py , 啓動管道組件及其他相關設置)
  • 運行爬蟲文件(命令行:scrapy crawl tencent)
  • (pycharm中運行,需要start.py支持,後邊介紹)

4.爬蟲類Scrapy源碼剖析:

本博客文章有涉及。

5.爬蟲運作流程:

6.爬取內容保存格式:

  • json格式,默認爲Unicode編碼(scrapy crawl tencent -o tencent.json)
  • json lines格式,默認Unicode編碼(scrapy crawl tencent -o tencent.jsonl)
  • csv逗號表達式,可用excel打開(scrapy crawl tencent -o tencent.csv)
  • xml格式(scrapy crawl tencent.xml)

7.python3數據類型轉換函數:

  • int(x, base=進制數):相當於int類調用了__init__(self, x, base=10)初始化方法
  • float(x, base=進制數):相當於調用了float類的__init__(self, x)初始化方法
  • complex(x):相當於調用complex類的__init__()創建一個複數
  • str(x, encoding=None, errors="strict"):相當於調用str類__init__(),創建一個字符串對象
  • repr(object):python內建函數。返回對象(類創建的)的規範字符串表達式
  • eval(*args, **kwargs):python內置函數。源可爲python字符串或compile()返回的代碼對象。且全局變量必須是字典,本地變量可以是任何映射
  • tuple(seq=()):python內置函數。將對象轉換成一個元組。傳入參數是元組,則仍然返回元組。
  • list(seq=()):調用list類初始化方法__init__(self, seq=())將序列seq轉換成列表。
  • chr(x): 0<=x<=0x10ffff。將x轉換爲對應的Unicode字符。
  • ord(x):python內置函數。將x(單個字符)對應的Unicode編碼輸出。
  • hex(*args, **kwargs):python內置函數。將一個整數轉換成十六進制字符串。
  • oct(*args, **kwargs):python內置函數。將一個整數轉換成八進制字符串。

8.網頁數據的抓取與保存

    items.py   

  • 爬取的目標就是從非結構性的數據源提取結構性數據,scrapy提供item類來滿足這樣的需求。item對象是一種簡單的容器,保存爬到的數據。其提供了 類似於詞典(dictionary-like) 的API以及用於聲明可用字段的簡單語法
  • ①Field對象指明瞭每個字段的元數據(任何元數據),Field對象接受的值沒有任何限制
  • ②設置Field對象的主要目就是在一個地方定義好所有的元數據
  • ③注意,聲明item的Field對象,並沒有被賦值成class屬性。(可通過item.fields進行訪問)
  • ④Field類僅是內置字典類(dict)的一個別名,並沒有提供額外的方法和屬性。被用來基於類屬性的方法來支持item生命語法。

【代碼實例】

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


# 明確要抓取的目標
class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 1.職位名稱
    positionName = scrapy.Field()

    # 2.職位詳情鏈接
    positionLink = scrapy.Field()

    # 3.職位類別
    positionType = scrapy.Field()

    # 4.招聘人數
    positionNameNumber = scrapy.Field()

    # 5.工作地點
    workLocation = scrapy.Field()

    # 6.發佈時間
    publishTime = scrapy.Field()

   tencent.py   

  • 該文件是編寫爬蟲的主文件,製作爬蟲,開始爬取網頁

【代碼實例】

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem


class TencentSpider(scrapy.Spider):
    # 1.必需參數。定義spider名字的字符串,定義了Scrapy如何定位並初始化spider。一般以該網站(不加後綴)命名
    name = 'tencent'
    # 2.可選參數。包含spider允許爬取的域名列表。
    allowed_domains = ['www.tencent.com']


    # 1.設置不同頁面的採集地址
    # baseURL = "https://hr.tencent.com/position.php?&start="
    # url地址的偏移量
    # offset = 0
    # start_urls = [baseURL + str(offset)]


    baseURL = "https://hr.tencent.com/"
    start_urls = ["https://hr.tencent.com/position.php?&start0#a="]

    # 3.解析的方法,每個初始URL完成下載後將被調用,調用的時候傳入從每一個URL傳回的response對象,作爲唯一參數
    # 主要作用:
    # 負責解析返回的網頁數據(response.body),提取結構化數據(生成item)
    # 生成需要下一頁的URL請求
    def parse(self, response):
        """處理response"""
        node_List = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

        for node in node_List:
            item = TencentItem()

            # 4.提取每個職位信息,提取出Unicode字符串
            # 4.此處驗證提取內容是否正確,可通過谷歌插件(xpath)
            item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0]
            item["positionLink"] = node.xpath("./td[1]/a/@href").extract()[0]
            if len(node.xpath("./td[2]/text()")):
                item["positionType"] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item["positionType"] = ""

            item["positionNameNumber"] = node.xpath("./td[3]/text()").extract()[0]
            item["workLocation"] = node.xpath("./td[4]/text()").extract()[0]
            item["publishTime"] = node.xpath("./td[5]/text()").extract()[0]

            # yield返回數據後,回來繼續執行yield後邊的代碼
            yield item

        # # 3.翻頁:拼接URL地址
        # # 3.適用於沒法獲取下一頁鏈接的情況,不許通過拼接URL才能獲取響應
        # if self.offset < 3920:
        #     self.offset += 10
        #     url = self.baseURL + str(self.offset)
        #
        #     # 4.request的地址和allow_domain裏面的衝突,從而被過濾掉。可以停用過濾功能dont_filter=True
        #     yield scrapy.Request(url, callback=self.parse, dont_filter=True)

        # 5.直接從response獲取需要爬取的連接,併發送請求處理,知道鏈接全部提取完畢
        # 判斷是否是最後一頁(xpath多重屬性組合查找【同一個標籤內部】)
        if not len(response.xpath("//a[@class='noactive' and @id='next']")):
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            # 6.將請求發回給引擎,引擎返回給調度器,調度器將請求入隊列
            yield scrapy.Request(self.baseURL + url, callback=self.parse, dont_filter=True)

pipelines.py  

  • 編寫管道文件,存儲爬取到的內容。
  • 【代碼實例】(存儲在本地磁盤文件)
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.json", "w")

    def process_item(self, item, spider):
        # 1.將scrapy中item字段轉換成python數據類型:字典
        # 2.將字典數據類型轉換爲Json格式
        content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.f.write(str(content.encode("utf-8"), encoding="utf-8"))

        return item

    def close_spider(self, spider):
        self.f.close()

settings.py   

  • 啓動管道組件及相關設置。
  • 【代碼實例】
# -*- coding: utf-8 -*-

# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tencent (+http://www.yourdomain.com)'

# 1.是否遵守互聯網robots協議(選擇False)
ROBOTSTXT_OBEY = False

# 2.爬蟲(scrapy)執行的最大併發請求(默認值16)
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# 3.設置管道的優先級,數字越小,管道優先級越高
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 100,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py   

  • 負責啓動整個爬蟲。
  • 【代碼實例】
# -*- coding:utf-8 -*-
from scrapy import cmdline


cmdline.execute("scrapy crawl tencent".split())

9.部分爬取結果展示

{"positionName": "15618-角色原畫設計(上海)", "positionLink": "position_detail.php?id=39412&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-美術特效設計師(上海)", "positionLink": "position_detail.php?id=39414&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-資深技術美術(上海)", "positionLink": "position_detail.php?id=39415&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告QA質量管理工程師(深圳)", "positionLink": "position_detail.php?id=39404&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告平臺項目經理(深圳)", "positionLink": "position_detail.php?id=39406&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "20503-優圖Web資深前端開發工程師(上海)", "positionLink": "position_detail.php?id=39407&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "SNG04-內容合作-創作者運營(北京)", "positionLink": "position_detail.php?id=39408&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "2", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "23675-Android客戶端開發(北京)", "positionLink": "position_detail.php?id=39409&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "23675-WEB前端開發工程師(北京)", "positionLink": "position_detail.php?id=39410&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "OMG064-應用開發工程師(上海)", "positionLink": "position_detail.php?id=39411&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-3D場景設計師(上海)", "positionLink": "position_detail.php?id=39413&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15583-棋牌遊戲運營(深圳)", "positionLink": "position_detail.php?id=39397&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SNG03-遊戲社交產品經理(深圳)", "positionLink": "position_detail.php?id=39400&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SNG03-語音社區運營經理(深圳)", "positionLink": "position_detail.php?id=39401&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告系統後臺開發高級工程師(北京)", "positionLink": "position_detail.php?id=39402&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "3", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告商業分析經理(深圳)", "positionLink": "position_detail.php?id=39403&keywords=&tid=0&lid=0", "positionType": "市場類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告高級系統測試工程師(深圳)", "positionLink": "position_detail.php?id=39405&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "23674-語音資訊後臺開發高級工程師(北京)", "positionLink": "position_detail.php?id=39391&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "21529-高級法律顧問(深圳)", "positionLink": "position_detail.php?id=39392&keywords=&tid=0&lid=0", "positionType": "職能類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "15579-互娛3D角色模型師(深圳)", "positionLink": "position_detail.php?id=39394&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章