簡介

Scrapy 框架

Scrapy是用純Python實現一個爲了爬取網站數據、提取結構性數據而編寫的應用框架。
用戶只需要定製開發幾個模塊就可以輕鬆的實現一個爬蟲，用來抓取網頁內容以及各種圖片。
(提高請求效率)
Scrapy 使用了Twisted(aiohttp)異步網絡框架來處理網絡通訊，可以加快下載速度，並且包含了各種中間件接口，可以靈活的完成各種需求。

安裝

pip install --upgrade pip
建議首先更新pip 再安裝下列依賴庫否則可能會遇到諸多錯誤：

pip install twisted
安裝 twisted可能會遇到這樣問題
building ‘twisted.test.raiser’ extension
error: Microsoft Visual C++ 14.0 is required. Get it with “Microsoft Visual C++ Build Tools”: http://landinghub.visualstudio.com/visual-cpp-build-tools
意思是說缺少C++的一些編譯工具

所以這裏建議大家直接安裝編譯好的twisted的whl文件
對應資源下載網址：
https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

下載twisted對應版本的whl文件，cp後面是python版本，amd64代表64位，運行命令：
pip install C:\Users\CR\Downloads\Twisted-17.5.0-cp36-cp36m-win_amd64.whl
（後邊一部分是 whl文件的絕對路徑）

lxml之前應該安裝過可以略過
pip install lxml

這個安裝應該沒問題
pip install pywin32

安裝scrapy框架
pip install Scrapy

如果中途報錯，有TimeOut的字眼，應該是網絡問題，重複安裝幾次就行

當然最省事的還是直接替換安裝源
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple 包名

知識

整體結構

Scrapy Engine(引擎): 負責Spider、ItemPipeline、Downloader、Scheduler中間的通訊，信號、數據傳遞等

Scheduler(調度器):它負責接受引擎發送過來的Request請求，並按照一定的方式進行整理排列，入隊，當引擎需要時，交還給引擎。

Downloader（下載器）：負責下載Scrapy Engine(引擎)發送的所有Requests請求，並將其獲取到的Responses交還給Scrapy Engine(引擎)，由引擎交給Spider來處理，

Spider（爬蟲）：它負責處理所有Responses,從中分析提取數據，獲取Item字段需要的數據，並將需要跟進的URL提交給引擎，再次進入Scheduler(調度器)，

Item Pipeline(管道)：它負責處理Spider中獲取到的Item，並進行進行後期處理（詳細分析、過濾、存儲等）的地方.

Downloader Middlewares（下載中間件）：一個可以自定義擴展下載功能的組件。

Spider Middlewares（Spider中間件）：可以擴展操作引擎和Spider中間通信的功能組件

一次完整的請求流程

——> Spider（爬蟲）
——> Spider Middlewares（Spider中間件）
——> Scrapy Engine(引擎)
——> Scheduler(調度器)
——> Scrapy Engine(引擎)
——> Downloader Middlewares（下載中間件）
——> Downloader（下載器）
——> Downloader Middlewares（下載中間件）
——> Scrapy Engine(引擎)
——> Spider Middlewares（Spider中間件）
——> Spider（爬蟲）
——> Spider Middlewares（Spider中間件）
——> Item Pipeline(管道)

Scrapy完成爬蟲需要涉及的文件和步驟

新建項目 (scrapy startproject xxx)：新建一個新的爬蟲項
明確目標（編寫items.py）：明確你想要抓取的目標
製作爬蟲（spiders/xxspider.py）：製作爬蟲開始爬取網頁
存儲內容（pipelines.py）：設計管道存儲爬取內容

入門案例

項目創建

執行：
終端輸入代碼：scrapy startproject JobSpider

結果：
目錄結構：

確定目標

確定要爬取的網站和信息的url

抓取上海地區python招聘職位的情況，包括職位名稱，公司名稱，工作地點，薪資，發佈時間：

目標地址：
https://search.51job.com/list/020000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare

1.打開項目中的item.py文件

2.創建一個JobspiderItem類，繼承scrapy.Item，構建item模型（model）。並且定義類型爲scrapy.Field的類屬性，用來保存爬取到的數據

class JobspiderItem(scrapy.Item):
    name = scrapy.Field()
    city = scrapy.Field()
    pub_date = scrapy.Field()
    salary = scrapy.Field()

製作爬蟲

生成爬蟲腳本模板

執行：
終端輸入代碼：
1.cd JobSpider

2.scrapy genspider pythonPosition 51job.com

結果：

腳本說明

打開上一步生成的模板

import scrapy

class PythonpositionSpider(scrapy.Spider):
    name = 'PythonPosition'
    allowed_domains = ['51job.com']
    start_urls = ['http://51job.com/',]

    def parse(self, response):
        pass

說明：

1.name=""：這個爬蟲的識別名稱，必須是唯一的，在不同的爬蟲必須定義不同的名字。

2.allow_domains=[]是搜索的域名範圍，也就是爬蟲的約束區域，規定爬蟲只爬取這個域名下的網頁，不存在的URL會被忽略。

3.start_urls=()：爬取的URL元祖/列表。爬蟲從這裏開始抓取數據，所以，第一次下載的數據將會從這些urls開始。

4.parse(self,response)：解析的方法，每個初始URL完成下載後將被調用，調用的時候傳入從每一個URL傳回的Response對象來作爲唯一參數，主要作用如下：
負責解析返回的網頁數據(response.body)，提取結構化數據(生成item)
生成需要下一頁的URL請求。

修改代碼

# 1.將start_urls的值修改爲需要爬取的第一個url

start_urls=("https://search.51job.com/list/020000,000000,0000,00,9,99,python,2,1.html?lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&confirmdate=9&fromType=&dibiaoid=0&address=&line=&specialarea=00&from=&welfare=",)


# 2.重寫parse方法，用XPath提取數據
def parse(self, response):
#print(response.body)
job_list = response.xpath("//div[@class='dw_table']/div[@class='el']")

for each in job_list:
    name = each.xpath("normalize-space(./p/span/a/text())").extract()[0]
    city = each.xpath("./span[@class='t3']/text()").extract()[0]
    pub_date = each.xpath(".//span[@class='t5']/text()").extract()[0]
    salary = each.xpath(".//span[@class='t4']/text()").extract()
    if len(salary)>0:
        salary = salary[0]
    else:
        salary = ''

    item = JobspiderItem()
    item['name'] = name
    item['city'] = city
    item['pub_date'] = pub_date
    item['salary'] = salary
    # 將獲取的數據交給pipeline 
    yield item

yield 不僅可以返回數據對象，也可以返回請求對象。

儲存內容

打開存儲管道文件：

process_item 方法是用來處理每一個item數據對象的
close_spider 關閉爬蟲時調用

# 寫入CSV文件：
import csv
import codecs
class JobspiderPipeline(object):

    def __init__(self): 
        self.file = codecs.open('51job.csv', 'w', 'utf-8')
        self.wr = csv.writer(self.file, dialect="excel")
        self.wr.writerow(['name', 'pub_date', 'city', 'salary'])
    
    def process_item(self, item, spider): 
        self.wr.writerow([item['name'], item['pub_date'], item['city'], item['salary']])
        return item
    
    def close_spider(self, spider): 
        self.file.close()

在setting文件中啓動儲存管道

啓用一個Item Pipeline組件

爲了啓用Item Pipeline組件，必須將它的類添加到settings.py文件ITEM_PIPELINES 配置，比如:

ITEM_PIPELINES = {
    #'mySpider.pipelines.SomePipeline': 300,"
    mySpider.pipelines.JsonPipeline":300
}

分配給每個類的整型值，確定了他們運行的順序，item按數字從低到高的順序通過pipeline，通常將這些數字定義在0-1000範圍內（0-1000隨意設置，數值越低，組件的優先級越高）

運行測試

1、創建run.py文件，和setting.py同級目錄

2、添加代碼：

from scrapy import cmdline

name = 'pythonPosition'
cmd = 'scrapy crawl {0}'.format(name)

#scrapy crawl pythonPosition
cmdline.execute(cmd.split())

其中name參數爲spider的name。

3、run.py文件中右鍵運行。

任務

使用scrapy完成豆瓣電影top 250爬蟲，電影的名稱，簡介，評分，引文等信息採集
https://movie.douban.com/top250?start=0&filter=2

第十一章 Scrapy入門：多線程+異步

簡介

安裝

知識

整體結構

一次完整的請求流程

Scrapy完成爬蟲需要涉及的文件和步驟

入門案例

項目創建

確定目標

製作爬蟲

儲存內容

運行測試

任務

第五章正則：通喫一切字符串處理

win10 tensorflow2.2 安裝踩坑總結

第十二章 Scrapy中間件與圖片管道

第九章爬蟲基礎總結

第十一章 Scrapy入門：多線程+異步

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結