1.爬蟲項目工具：

PyCharm集成開發環境

Google瀏覽器

Google瀏覽器插件：xpath helper, json viewer視圖插件

2.命令行scrapy重要參數：

check : 檢查項目，並由crawl返回

crawl : 運行爬蟲抓取數據

edit : 通過編輯器編輯爬蟲文件

fentch : 使用Scrapy downloader 提取的 URL

genspider : 創建爬蟲文件

list : 顯示本項目中可用的爬蟲列表

shell : 啓動給定URL的一個交互式模塊

startproject : 創建了一個新的 Scrapy 項目

3.爬蟲製作步驟：

新建爬蟲項目（scrapy startproject Tencent）

創建爬蟲文件（scrapy genspider tencent "https://hr.tencent.com/"）

明確抓取內容（編寫 items.py）

編寫爬蟲程序（編寫 spiders/spider.py）

編寫管道文件（編寫 pipelines.py, 存儲爬取內容）

設置配置文件（編寫settings.py , 啓動管道組件及其他相關設置）

運行爬蟲文件（命令行：scrapy crawl tencent）

（pycharm中運行，需要start.py支持，後邊介紹）

4.爬蟲類Scrapy源碼剖析：

本博客文章有涉及。

5.爬蟲運作流程：

6.爬取內容保存格式：

json格式，默認爲Unicode編碼（scrapy crawl tencent -o tencent.json）

json lines格式，默認Unicode編碼（scrapy crawl tencent -o tencent.jsonl）

csv逗號表達式，可用excel打開（scrapy crawl tencent -o tencent.csv）

xml格式（scrapy crawl tencent.xml）

7.python3數據類型轉換函數：

int(x, base=進制數)：相當於int類調用了__init__(self, x, base=10)初始化方法

float(x, base=進制數)：相當於調用了float類的__init__(self, x)初始化方法

complex(x)：相當於調用complex類的__init__()創建一個複數

str(x, encoding=None, errors="strict")：相當於調用str類__init__()，創建一個字符串對象

repr(object)：python內建函數。返回對象(類創建的)的規範字符串表達式

eval(*args, **kwargs)：python內置函數。源可爲python字符串或compile()返回的代碼對象。且全局變量必須是字典，本地變量可以是任何映射

tuple(seq=())：python內置函數。將對象轉換成一個元組。傳入參數是元組，則仍然返回元組。

list(seq=())：調用list類初始化方法__init__(self, seq=())將序列seq轉換成列表。

chr(x): 0<=x<=0x10ffff。將x轉換爲對應的Unicode字符。

ord(x)：python內置函數。將x（單個字符）對應的Unicode編碼輸出。

hex(*args, **kwargs)：python內置函數。將一個整數轉換成十六進制字符串。

oct(*args, **kwargs)：python內置函數。將一個整數轉換成八進制字符串。

8.網頁數據的抓取與保存

items.py

爬取的目標就是從非結構性的數據源提取結構性數據，scrapy提供item類來滿足這樣的需求。item對象是一種簡單的容器，保存爬到的數據。其提供了類似於詞典(dictionary-like) 的API以及用於聲明可用字段的簡單語法

①Field對象指明瞭每個字段的元數據（任何元數據），Field對象接受的值沒有任何限制

②設置Field對象的主要目就是在一個地方定義好所有的元數據

③注意，聲明item的Field對象，並沒有被賦值成class屬性。（可通過item.fields進行訪問）

④Field類僅是內置字典類（dict）的一個別名，並沒有提供額外的方法和屬性。被用來基於類屬性的方法來支持item生命語法。

【代碼實例】

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


# 明確要抓取的目標
class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 1.職位名稱
    positionName = scrapy.Field()

    # 2.職位詳情鏈接
    positionLink = scrapy.Field()

    # 3.職位類別
    positionType = scrapy.Field()

    # 4.招聘人數
    positionNameNumber = scrapy.Field()

    # 5.工作地點
    workLocation = scrapy.Field()

    # 6.發佈時間
    publishTime = scrapy.Field()

tencent.py

該文件是編寫爬蟲的主文件，製作爬蟲，開始爬取網頁

【代碼實例】

# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem


class TencentSpider(scrapy.Spider):
    # 1.必需參數。定義spider名字的字符串，定義了Scrapy如何定位並初始化spider。一般以該網站（不加後綴）命名
    name = 'tencent'
    # 2.可選參數。包含spider允許爬取的域名列表。
    allowed_domains = ['www.tencent.com']


    # 1.設置不同頁面的採集地址
    # baseURL = "https://hr.tencent.com/position.php?&start="
    # url地址的偏移量
    # offset = 0
    # start_urls = [baseURL + str(offset)]


    baseURL = "https://hr.tencent.com/"
    start_urls = ["https://hr.tencent.com/position.php?&start0#a="]

    # 3.解析的方法，每個初始URL完成下載後將被調用，調用的時候傳入從每一個URL傳回的response對象，作爲唯一參數
    # 主要作用：
    # 負責解析返回的網頁數據（response.body），提取結構化數據（生成item）
    # 生成需要下一頁的URL請求
    def parse(self, response):
        """處理response"""
        node_List = response.xpath("//tr[@class='even'] | //tr[@class='odd']")

        for node in node_List:
            item = TencentItem()

            # 4.提取每個職位信息，提取出Unicode字符串
            # 4.此處驗證提取內容是否正確，可通過谷歌插件（xpath）
            item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0]
            item["positionLink"] = node.xpath("./td[1]/a/@href").extract()[0]
            if len(node.xpath("./td[2]/text()")):
                item["positionType"] = node.xpath("./td[2]/text()").extract()[0]
            else:
                item["positionType"] = ""

            item["positionNameNumber"] = node.xpath("./td[3]/text()").extract()[0]
            item["workLocation"] = node.xpath("./td[4]/text()").extract()[0]
            item["publishTime"] = node.xpath("./td[5]/text()").extract()[0]

            # yield返回數據後，回來繼續執行yield後邊的代碼
            yield item

        # # 3.翻頁：拼接URL地址
        # # 3.適用於沒法獲取下一頁鏈接的情況，不許通過拼接URL才能獲取響應
        # if self.offset < 3920:
        #     self.offset += 10
        #     url = self.baseURL + str(self.offset)
        #
        #     # 4.request的地址和allow_domain裏面的衝突，從而被過濾掉。可以停用過濾功能dont_filter=True
        #     yield scrapy.Request(url, callback=self.parse, dont_filter=True)

        # 5.直接從response獲取需要爬取的連接，併發送請求處理，知道鏈接全部提取完畢
        # 判斷是否是最後一頁(xpath多重屬性組合查找【同一個標籤內部】)
        if not len(response.xpath("//a[@class='noactive' and @id='next']")):
            url = response.xpath("//a[@id='next']/@href").extract()[0]
            # 6.將請求發回給引擎，引擎返回給調度器，調度器將請求入隊列
            yield scrapy.Request(self.baseURL + url, callback=self.parse, dont_filter=True)

pipelines.py

編寫管道文件，存儲爬取到的內容。

【代碼實例】（存儲在本地磁盤文件）

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json


class TencentPipeline(object):
    def __init__(self):
        self.f = open("tencent.json", "w")

    def process_item(self, item, spider):
        # 1.將scrapy中item字段轉換成python數據類型：字典
        # 2.將字典數據類型轉換爲Json格式
        content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
        self.f.write(str(content.encode("utf-8"), encoding="utf-8"))

        return item

    def close_spider(self, spider):
        self.f.close()

settings.py

啓動管道組件及相關設置。

【代碼實例】

# -*- coding: utf-8 -*-

# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'Tencent'

SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tencent (+http://www.yourdomain.com)'

# 1.是否遵守互聯網robots協議（選擇False）
ROBOTSTXT_OBEY = False

# 2.爬蟲（scrapy）執行的最大併發請求（默認值16）
CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# 3.設置管道的優先級，數字越小，管道優先級越高
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'Tencent.pipelines.TencentPipeline': 100,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

start.py

負責啓動整個爬蟲。

【代碼實例】

# -*- coding:utf-8 -*-
from scrapy import cmdline


cmdline.execute("scrapy crawl tencent".split())

9.部分爬取結果展示

{"positionName": "15618-角色原畫設計（上海）", "positionLink": "position_detail.php?id=39412&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-美術特效設計師（上海）", "positionLink": "position_detail.php?id=39414&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-資深技術美術（上海）", "positionLink": "position_detail.php?id=39415&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告QA質量管理工程師（深圳）", "positionLink": "position_detail.php?id=39404&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告平臺項目經理（深圳）", "positionLink": "position_detail.php?id=39406&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "20503-優圖Web資深前端開發工程師（上海）", "positionLink": "position_detail.php?id=39407&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "SNG04-內容合作-創作者運營（北京）", "positionLink": "position_detail.php?id=39408&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "2", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "23675-Android客戶端開發（北京）", "positionLink": "position_detail.php?id=39409&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "23675-WEB前端開發工程師（北京）", "positionLink": "position_detail.php?id=39410&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "OMG064-應用開發工程師（上海）", "positionLink": "position_detail.php?id=39411&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-3D場景設計師（上海）", "positionLink": "position_detail.php?id=39413&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15583-棋牌遊戲運營（深圳）", "positionLink": "position_detail.php?id=39397&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SNG03-遊戲社交產品經理（深圳）", "positionLink": "position_detail.php?id=39400&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SNG03-語音社區運營經理（深圳）", "positionLink": "position_detail.php?id=39401&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告系統後臺開發高級工程師（北京）", "positionLink": "position_detail.php?id=39402&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "3", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告商業分析經理（深圳）", "positionLink": "position_detail.php?id=39403&keywords=&tid=0&lid=0", "positionType": "市場類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告高級系統測試工程師（深圳）", "positionLink": "position_detail.php?id=39405&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "23674-語音資訊後臺開發高級工程師（北京）", "positionLink": "position_detail.php?id=39391&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "21529-高級法律顧問（深圳）", "positionLink": "position_detail.php?id=39392&keywords=&tid=0&lid=0", "positionType": "職能類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "15579-互娛3D角色模型師（深圳）", "positionLink": "position_detail.php?id=39394&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},

網絡爬蟲 | 騰訊招聘信息採集——基於Python中Scrapy框架

1.爬蟲項目工具：

2.命令行scrapy重要參數：

3.爬蟲製作步驟：

4.爬蟲類Scrapy源碼剖析：

5.爬蟲運作流程：

6.爬取內容保存格式：

7.python3數據類型轉換函數：

8.網頁數據的抓取與保存

9.部分爬取結果展示

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

【2024-05-21】以茶會友

機器學習 | 特徵工程 —— 降維：PCA（主成分分析）

認知物理學思維導圖

python3深度學習卷積神經網絡(CNN)：VGGNet / Finetuning

Google瀏覽器截圖方法

python3深度學習過擬合/欠擬合的處理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結