Scrapy使用shell調試

使用shell嘗試爬取

$ scrapy shell https://www.zhipin.com/c101280100/

2020-02-07 10:42:20 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 10:42:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-02-07 10:42:20 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2020-02-07 10:42:20 [scrapy.extensions.telnet] INFO: Telnet Password: e782852fc4dca748
2020-02-07 10:42:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 10:42:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-07 10:42:20 [scrapy.core.engine] INFO: Spider opened
2020-02-07 10:42:21 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101280100/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1048ed820>
[s]   item       {}
[s]   request    <GET https://www.zhipin.com/c101280100/>
[s]   response   <403 https://www.zhipin.com/c101280100/>
[s]   settings   <scrapy.settings.Settings object at 0x1048ed3d0>
[s]   spider     <DefaultSpider 'default' at 0x104db0490>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

vailable Scrapy objects:在Scrapy中可以使用的對象
scrapy：當前scrapy的模塊，包含scrapy.Request, scrapy.Selector等等
item：抓取的item
request：向網站發起的請求
response：從服務器拿到的響應（403：沒有成功拿到服務器響應，如果目標頁面返回403，那就表明對方網站做了一些反爬處理；真正的響應應該是200）
settings：當前項目的設置
spider：默認的spider

解決方案：

通常的處理是：

瀏覽器僞裝
模擬登錄

這裏因爲使用的shell調試，所以選擇瀏覽器僞裝：

爲了讓Scrapy僞裝成瀏覽器，需要在發送請求時設置User-Agent頭，將User-Agent頭的值設爲真實瀏覽器發送請求的User-Agent頭即可。
通過真實的瀏覽器查看User-Agent頭的值

Control + z終止當前進程
再輸入：

$ scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/c101280100/

2020-02-07 10:52:48 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 10:52:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-02-07 10:52:48 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0'}
2020-02-07 10:52:48 [scrapy.extensions.telnet] INFO: Telnet Password: 0a8453c96491ca13
2020-02-07 10:52:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 10:52:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-02-07 10:52:48 [scrapy.core.engine] INFO: Spider opened
2020-02-07 10:52:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=> from <GET https://www.zhipin.com/c101280100/>
2020-02-07 10:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x105c93640>
[s]   item       {}
[s]   request    <GET https://www.zhipin.com/c101280100/>
[s]   response   <200 https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=>
[s]   settings   <scrapy.settings.Settings object at 0x105c93400>
[s]   spider     <DefaultSpider 'default' at 0x1061578b0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

request    <GET https://www.zhipin.com/c101280100/>
response   <200 https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=>

返回200說明抓取成功，
但是response和request網址不一致，應該是做了反爬處理

這裏暫時不處理，換個網址：

$ scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/

[s]   request    <GET https://www.zhipin.com/>
[s]   response   <200 https://www.zhipin.com/>

提取數據

XPath提取法

response.xpath('//div[@class="text"]/a/text()').extract()

['後端開發', 'Java', 'C++', 'PHP', '數據挖掘', 'C', 'C#', '.NET', 'Hadoop', 'Python', 'Delphi', 'VB', 'Perl', 'Ruby', 'Node.js', '搜索算法', 'Golang', '推薦算法', 'Erlang', '算法工程師', '語音/視頻/圖形開發', '數據採集'...省略]

內容太多了就不一一寫出來了

XPath最實用的簡化寫法

CSS選擇器提取法

response.css('div.menu-sub div.text').extract()

['<div class="text">\n                                      <a ka="search_100199" href="/c101010100-p100199/">後端開發</a>\n                                                
<a ka="search_100101" href="/c101010100-p100101/">Java</a>\n<a ka="search_100102" href="/c101010100-p100102/">C++</a>\n <a ka="search_100103" href="/c101010100-p100103/">PHP</a>\n
<a ka="search_100104" href="/c101010100-p100104/">數據挖掘</a>\n                                                
<a ka="search_100105" href="/c101010100-p100105/">C</a>\n
...省略

Scrapy使用shell調試

使用shell嘗試爬取

解決方案：

提取數據

XPath提取法

XPath最實用的簡化寫法

CSS選擇器提取法

Scrapy使用shell調試

Mac for mongodb的安裝以及可視化工具的使用

Python實戰演練之scrapy初體驗

Python實戰演練之跨頁爬取

mongodb的配置文件的解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結