目錄
1.爬蟲項目工具:
- PyCharm集成開發環境
- Google瀏覽器
- Google瀏覽器插件:xpath helper, json viewer視圖插件
2.命令行scrapy重要參數:
- check : 檢查項目,並由crawl返回
- crawl : 運行爬蟲抓取數據
- edit : 通過編輯器編輯爬蟲文件
- fentch : 使用Scrapy downloader 提取的 URL
- genspider : 創建爬蟲文件
- list : 顯示本項目中可用的爬蟲列表
- shell : 啓動給定URL的一個交互式模塊
- startproject : 創建了一個新的 Scrapy 項目
3.爬蟲製作步驟:
- 新建爬蟲項目(scrapy startproject Tencent)
- 創建爬蟲文件(scrapy genspider tencent "https://hr.tencent.com/")
- 明確抓取內容(編寫 items.py)
- 編寫爬蟲程序(編寫 spiders/spider.py)
- 編寫管道文件(編寫 pipelines.py, 存儲爬取內容)
- 設置配置文件(編寫settings.py , 啓動管道組件及其他相關設置)
- 運行爬蟲文件(命令行:scrapy crawl tencent)
- (pycharm中運行,需要start.py支持,後邊介紹)
4.爬蟲類Scrapy源碼剖析:
本博客文章有涉及。
5.爬蟲運作流程:
6.爬取內容保存格式:
- json格式,默認爲Unicode編碼(scrapy crawl tencent -o tencent.json)
- json lines格式,默認Unicode編碼(scrapy crawl tencent -o tencent.jsonl)
- csv逗號表達式,可用excel打開(scrapy crawl tencent -o tencent.csv)
- xml格式(scrapy crawl tencent.xml)
7.python3數據類型轉換函數:
- int(x, base=進制數):相當於int類調用了__init__(self, x, base=10)初始化方法
- float(x, base=進制數):相當於調用了float類的__init__(self, x)初始化方法
- complex(x):相當於調用complex類的__init__()創建一個複數
- str(x, encoding=None, errors="strict"):相當於調用str類__init__(),創建一個字符串對象
- repr(object):python內建函數。返回對象(類創建的)的規範字符串表達式
- eval(*args, **kwargs):python內置函數。源可爲python字符串或compile()返回的代碼對象。且全局變量必須是字典,本地變量可以是任何映射
- tuple(seq=()):python內置函數。將對象轉換成一個元組。傳入參數是元組,則仍然返回元組。
- list(seq=()):調用list類初始化方法__init__(self, seq=())將序列seq轉換成列表。
- chr(x): 0<=x<=0x10ffff。將x轉換爲對應的Unicode字符。
- ord(x):python內置函數。將x(單個字符)對應的Unicode編碼輸出。
- hex(*args, **kwargs):python內置函數。將一個整數轉換成十六進制字符串。
- oct(*args, **kwargs):python內置函數。將一個整數轉換成八進制字符串。
8.網頁數據的抓取與保存
items.py
- 爬取的目標就是從非結構性的數據源提取結構性數據,scrapy提供item類來滿足這樣的需求。item對象是一種簡單的容器,保存爬到的數據。其提供了 類似於詞典(dictionary-like) 的API以及用於聲明可用字段的簡單語法
- ①Field對象指明瞭每個字段的元數據(任何元數據),Field對象接受的值沒有任何限制
- ②設置Field對象的主要目就是在一個地方定義好所有的元數據
- ③注意,聲明item的Field對象,並沒有被賦值成class屬性。(可通過item.fields進行訪問)
- ④Field類僅是內置字典類(dict)的一個別名,並沒有提供額外的方法和屬性。被用來基於類屬性的方法來支持item生命語法。
【代碼實例】
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
# 明確要抓取的目標
class TencentItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 1.職位名稱
positionName = scrapy.Field()
# 2.職位詳情鏈接
positionLink = scrapy.Field()
# 3.職位類別
positionType = scrapy.Field()
# 4.招聘人數
positionNameNumber = scrapy.Field()
# 5.工作地點
workLocation = scrapy.Field()
# 6.發佈時間
publishTime = scrapy.Field()
tencent.py
- 該文件是編寫爬蟲的主文件,製作爬蟲,開始爬取網頁
【代碼實例】
# -*- coding: utf-8 -*-
import scrapy
from Tencent.items import TencentItem
class TencentSpider(scrapy.Spider):
# 1.必需參數。定義spider名字的字符串,定義了Scrapy如何定位並初始化spider。一般以該網站(不加後綴)命名
name = 'tencent'
# 2.可選參數。包含spider允許爬取的域名列表。
allowed_domains = ['www.tencent.com']
# 1.設置不同頁面的採集地址
# baseURL = "https://hr.tencent.com/position.php?&start="
# url地址的偏移量
# offset = 0
# start_urls = [baseURL + str(offset)]
baseURL = "https://hr.tencent.com/"
start_urls = ["https://hr.tencent.com/position.php?&start0#a="]
# 3.解析的方法,每個初始URL完成下載後將被調用,調用的時候傳入從每一個URL傳回的response對象,作爲唯一參數
# 主要作用:
# 負責解析返回的網頁數據(response.body),提取結構化數據(生成item)
# 生成需要下一頁的URL請求
def parse(self, response):
"""處理response"""
node_List = response.xpath("//tr[@class='even'] | //tr[@class='odd']")
for node in node_List:
item = TencentItem()
# 4.提取每個職位信息,提取出Unicode字符串
# 4.此處驗證提取內容是否正確,可通過谷歌插件(xpath)
item["positionName"] = node.xpath("./td[1]/a/text()").extract()[0]
item["positionLink"] = node.xpath("./td[1]/a/@href").extract()[0]
if len(node.xpath("./td[2]/text()")):
item["positionType"] = node.xpath("./td[2]/text()").extract()[0]
else:
item["positionType"] = ""
item["positionNameNumber"] = node.xpath("./td[3]/text()").extract()[0]
item["workLocation"] = node.xpath("./td[4]/text()").extract()[0]
item["publishTime"] = node.xpath("./td[5]/text()").extract()[0]
# yield返回數據後,回來繼續執行yield後邊的代碼
yield item
# # 3.翻頁:拼接URL地址
# # 3.適用於沒法獲取下一頁鏈接的情況,不許通過拼接URL才能獲取響應
# if self.offset < 3920:
# self.offset += 10
# url = self.baseURL + str(self.offset)
#
# # 4.request的地址和allow_domain裏面的衝突,從而被過濾掉。可以停用過濾功能dont_filter=True
# yield scrapy.Request(url, callback=self.parse, dont_filter=True)
# 5.直接從response獲取需要爬取的連接,併發送請求處理,知道鏈接全部提取完畢
# 判斷是否是最後一頁(xpath多重屬性組合查找【同一個標籤內部】)
if not len(response.xpath("//a[@class='noactive' and @id='next']")):
url = response.xpath("//a[@id='next']/@href").extract()[0]
# 6.將請求發回給引擎,引擎返回給調度器,調度器將請求入隊列
yield scrapy.Request(self.baseURL + url, callback=self.parse, dont_filter=True)
pipelines.py
- 編寫管道文件,存儲爬取到的內容。
- 【代碼實例】(存儲在本地磁盤文件)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class TencentPipeline(object):
def __init__(self):
self.f = open("tencent.json", "w")
def process_item(self, item, spider):
# 1.將scrapy中item字段轉換成python數據類型:字典
# 2.將字典數據類型轉換爲Json格式
content = json.dumps(dict(item), ensure_ascii=False) + ",\n"
self.f.write(str(content.encode("utf-8"), encoding="utf-8"))
return item
def close_spider(self, spider):
self.f.close()
settings.py
- 啓動管道組件及相關設置。
- 【代碼實例】
# -*- coding: utf-8 -*-
# Scrapy settings for Tencent project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Tencent'
SPIDER_MODULES = ['Tencent.spiders']
NEWSPIDER_MODULE = 'Tencent.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tencent (+http://www.yourdomain.com)'
# 1.是否遵守互聯網robots協議(選擇False)
ROBOTSTXT_OBEY = False
# 2.爬蟲(scrapy)執行的最大併發請求(默認值16)
CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Tencent.middlewares.TencentSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Tencent.middlewares.TencentDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# 3.設置管道的優先級,數字越小,管道優先級越高
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Tencent.pipelines.TencentPipeline': 100,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
start.py
- 負責啓動整個爬蟲。
- 【代碼實例】
# -*- coding:utf-8 -*-
from scrapy import cmdline
cmdline.execute("scrapy crawl tencent".split())
9.部分爬取結果展示
{"positionName": "15618-角色原畫設計(上海)", "positionLink": "position_detail.php?id=39412&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-美術特效設計師(上海)", "positionLink": "position_detail.php?id=39414&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-資深技術美術(上海)", "positionLink": "position_detail.php?id=39415&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告QA質量管理工程師(深圳)", "positionLink": "position_detail.php?id=39404&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告平臺項目經理(深圳)", "positionLink": "position_detail.php?id=39406&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "20503-優圖Web資深前端開發工程師(上海)", "positionLink": "position_detail.php?id=39407&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "SNG04-內容合作-創作者運營(北京)", "positionLink": "position_detail.php?id=39408&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "2", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "23675-Android客戶端開發(北京)", "positionLink": "position_detail.php?id=39409&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "23675-WEB前端開發工程師(北京)", "positionLink": "position_detail.php?id=39410&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "OMG064-應用開發工程師(上海)", "positionLink": "position_detail.php?id=39411&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15618-3D場景設計師(上海)", "positionLink": "position_detail.php?id=39413&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "1", "workLocation": "上海", "publishTime": "2018-04-18"},
{"positionName": "15583-棋牌遊戲運營(深圳)", "positionLink": "position_detail.php?id=39397&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SNG03-遊戲社交產品經理(深圳)", "positionLink": "position_detail.php?id=39400&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SNG03-語音社區運營經理(深圳)", "positionLink": "position_detail.php?id=39401&keywords=&tid=0&lid=0", "positionType": "產品/項目類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告系統後臺開發高級工程師(北京)", "positionLink": "position_detail.php?id=39402&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "3", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告商業分析經理(深圳)", "positionLink": "position_detail.php?id=39403&keywords=&tid=0&lid=0", "positionType": "市場類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "SA-騰訊社交廣告高級系統測試工程師(深圳)", "positionLink": "position_detail.php?id=39405&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "23674-語音資訊後臺開發高級工程師(北京)", "positionLink": "position_detail.php?id=39391&keywords=&tid=0&lid=0", "positionType": "技術類", "positionNameNumber": "1", "workLocation": "北京", "publishTime": "2018-04-18"},
{"positionName": "21529-高級法律顧問(深圳)", "positionLink": "position_detail.php?id=39392&keywords=&tid=0&lid=0", "positionType": "職能類", "positionNameNumber": "1", "workLocation": "深圳", "publishTime": "2018-04-18"},
{"positionName": "15579-互娛3D角色模型師(深圳)", "positionLink": "position_detail.php?id=39394&keywords=&tid=0&lid=0", "positionType": "設計類", "positionNameNumber": "2", "workLocation": "深圳", "publishTime": "2018-04-18"},