Scrapy使用shell調試

使用shell嘗試爬取

$ scrapy shell https://www.zhipin.com/c101280100/
2020-02-07 10:42:20 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 10:42:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-02-07 10:42:20 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2020-02-07 10:42:20 [scrapy.extensions.telnet] INFO: Telnet Password: e782852fc4dca748
2020-02-07 10:42:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 10:42:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-07 10:42:20 [scrapy.core.engine] INFO: Spider opened
2020-02-07 10:42:21 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101280100/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1048ed820>
[s]   item       {}
[s]   request    <GET https://www.zhipin.com/c101280100/>
[s]   response   <403 https://www.zhipin.com/c101280100/>
[s]   settings   <scrapy.settings.Settings object at 0x1048ed3d0>
[s]   spider     <DefaultSpider 'default' at 0x104db0490>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
  • vailable Scrapy objects:在Scrapy中可以使用的對象
  • scrapy:當前scrapy的模塊,包含scrapy.Request, scrapy.Selector等等
  • item:抓取的item
  • request:向網站發起的請求
  • response:從服務器拿到的響應(403:沒有成功拿到服務器響應,如果目標頁面返回403,那就表明對方網站做了一些反爬處理;真正的響應應該是200)
  • settings:當前項目的設置
  • spider:默認的spider

解決方案:

通常的處理是:

  1. 瀏覽器僞裝
  2. 模擬登錄

這裏因爲使用的shell調試,所以選擇瀏覽器僞裝:

  1. 爲了讓Scrapy僞裝成瀏覽器,需要在發送請求時設置User-Agent頭,將User-Agent頭的值設爲真實瀏覽器發送請求的User-Agent頭即可。
  2. 通過真實的瀏覽器查看User-Agent頭的值

Control + z終止當前進程
再輸入:

$ scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/c101280100/
2020-02-07 10:52:48 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 10:52:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-02-07 10:52:48 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0'}
2020-02-07 10:52:48 [scrapy.extensions.telnet] INFO: Telnet Password: 0a8453c96491ca13
2020-02-07 10:52:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 10:52:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-02-07 10:52:48 [scrapy.core.engine] INFO: Spider opened
2020-02-07 10:52:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=> from <GET https://www.zhipin.com/c101280100/>
2020-02-07 10:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x105c93640>
[s]   item       {}
[s]   request    <GET https://www.zhipin.com/c101280100/>
[s]   response   <200 https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=>
[s]   settings   <scrapy.settings.Settings object at 0x105c93400>
[s]   spider     <DefaultSpider 'default' at 0x1061578b0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
request    <GET https://www.zhipin.com/c101280100/>
response   <200 https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=>

返回200說明抓取成功,
但是response和request網址不一致,應該是做了反爬處理

這裏暫時不處理,換個網址:

$ scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/
[s]   request    <GET https://www.zhipin.com/>
[s]   response   <200 https://www.zhipin.com/>

提取數據

XPath提取法

response.xpath('//div[@class="text"]/a/text()').extract()
['後端開發', 'Java', 'C++', 'PHP', '數據挖掘', 'C', 'C#', '.NET', 'Hadoop', 'Python', 'Delphi', 'VB', 'Perl', 'Ruby', 'Node.js', '搜索算法', 'Golang', '推薦算法', 'Erlang', '算法工程師', '語音/視頻/圖形開發', '數據採集'...省略]

內容太多了就不一一寫出來了

XPath最實用的簡化寫法在這裏插入圖片描述

CSS選擇器提取法

response.css('div.menu-sub div.text').extract()
['<div class="text">\n                                      <a ka="search_100199" href="/c101010100-p100199/">後端開發</a>\n                                                
<a ka="search_100101" href="/c101010100-p100101/">Java</a>\n<a ka="search_100102" href="/c101010100-p100102/">C++</a>\n <a ka="search_100103" href="/c101010100-p100103/">PHP</a>\n
<a ka="search_100104" href="/c101010100-p100104/">數據挖掘</a>\n                                                
<a ka="search_100105" href="/c101010100-p100105/">C</a>\n
...省略
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章