使用shell嘗試爬取
$ scrapy shell https://www.zhipin.com/c101280100/
2020-02-07 10:42:20 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 10:42:20 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-02-07 10:42:20 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0}
2020-02-07 10:42:20 [scrapy.extensions.telnet] INFO: Telnet Password: e782852fc4dca748
2020-02-07 10:42:20 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 10:42:20 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-02-07 10:42:20 [scrapy.core.engine] INFO: Spider opened
2020-02-07 10:42:21 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101280100/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x1048ed820>
[s] item {}
[s] request <GET https://www.zhipin.com/c101280100/>
[s] response <403 https://www.zhipin.com/c101280100/>
[s] settings <scrapy.settings.Settings object at 0x1048ed3d0>
[s] spider <DefaultSpider 'default' at 0x104db0490>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
- vailable Scrapy objects:在Scrapy中可以使用的對象
- scrapy:當前scrapy的模塊,包含scrapy.Request, scrapy.Selector等等
- item:抓取的item
- request:向網站發起的請求
- response:從服務器拿到的響應(403:沒有成功拿到服務器響應,如果目標頁面返回403,那就表明對方網站做了一些反爬處理;真正的響應應該是200)
- settings:當前項目的設置
- spider:默認的spider
解決方案:
通常的處理是:
- 瀏覽器僞裝
- 模擬登錄
這裏因爲使用的shell調試,所以選擇瀏覽器僞裝:
- 爲了讓Scrapy僞裝成瀏覽器,需要在發送請求時設置User-Agent頭,將User-Agent頭的值設爲真實瀏覽器發送請求的User-Agent頭即可。
- 通過真實的瀏覽器查看User-Agent頭的值
Control + z
終止當前進程
再輸入:
$ scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/c101280100/
2020-02-07 10:52:48 [scrapy.utils.log] INFO: Scrapy 1.8.0 started (bot: scrapybot)
2020-02-07 10:52:48 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.10.0, Python 3.8.0 (v3.8.0:fa919fdf25, Oct 14 2019, 10:23:27) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform macOS-10.15.2-x86_64-i386-64bit
2020-02-07 10:52:48 [scrapy.crawler] INFO: Overridden settings: {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0'}
2020-02-07 10:52:48 [scrapy.extensions.telnet] INFO: Telnet Password: 0a8453c96491ca13
2020-02-07 10:52:48 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-02-07 10:52:48 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6024
2020-02-07 10:52:48 [scrapy.core.engine] INFO: Spider opened
2020-02-07 10:52:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=> from <GET https://www.zhipin.com/c101280100/>
2020-02-07 10:52:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x105c93640>
[s] item {}
[s] request <GET https://www.zhipin.com/c101280100/>
[s] response <200 https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=>
[s] settings <scrapy.settings.Settings object at 0x105c93400>
[s] spider <DefaultSpider 'default' at 0x1061578b0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
request <GET https://www.zhipin.com/c101280100/>
response <200 https://www.zhipin.com/web/common/security-check.html?seed=5SNeLJwqAobUjABFPCWdhH%2FqcVhQYmT7DvOmFXMjb%2B8%3D&name=a38cd86c&ts=1581043939571&callbackUrl=%2Fc101280100%2F&srcReferer=>
返回200說明抓取成功,
但是response和request網址不一致,應該是做了反爬處理
這裏暫時不處理,換個網址:
$ scrapy shell -s USER_AGENT='Mozilla/5.0' https://www.zhipin.com/
[s] request <GET https://www.zhipin.com/>
[s] response <200 https://www.zhipin.com/>
提取數據
XPath提取法
response.xpath('//div[@class="text"]/a/text()').extract()
['後端開發', 'Java', 'C++', 'PHP', '數據挖掘', 'C', 'C#', '.NET', 'Hadoop', 'Python', 'Delphi', 'VB', 'Perl', 'Ruby', 'Node.js', '搜索算法', 'Golang', '推薦算法', 'Erlang', '算法工程師', '語音/視頻/圖形開發', '數據採集'...省略]
內容太多了就不一一寫出來了
XPath最實用的簡化寫法
CSS選擇器提取法
response.css('div.menu-sub div.text').extract()
['<div class="text">\n <a ka="search_100199" href="/c101010100-p100199/">後端開發</a>\n
<a ka="search_100101" href="/c101010100-p100101/">Java</a>\n<a ka="search_100102" href="/c101010100-p100102/">C++</a>\n <a ka="search_100103" href="/c101010100-p100103/">PHP</a>\n
<a ka="search_100104" href="/c101010100-p100104/">數據挖掘</a>\n
<a ka="search_100105" href="/c101010100-p100105/">C</a>\n
...省略