【爬蟲】（Scrapy）初學 Scrapy 過程中的知識和問題整理

Overview

XPath

在瀏覽器中使用 XPath

F12 -> Console:

> $x("<xpath>")

例子：

<div class="grade-box clearfix">
    <dl>...略...</dl>
    <dl>
        <dd title="60852">
                 6萬+         </dd>
    </dl>
    <dl>...略...</dl>
</div>

取出元素的具體值（比如，取出“6萬+” ）
> $x('//div[@class="grade-box clearfix"]//dl[2]//dd')[0]["innerHTML"]
```
"
            6萬+            "
```

取出元素的具體文本（比如，取出“6萬+” 但是不帶空白符）

> $x('//div[@class="grade-box clearfix"]//dl[2]//dd')[0]["innerText"]
"6萬+"

取出元素的 attribute

> $x('//div[@class="grade-box clearfix"]//dl[2]//dd/@title')[0]["textContent"]
"60852"
> $x('//div[@class="grade-box clearfix"]//dl[2]//dd/@title')[0]["value"]
"60852"

總結：

xpath 中間如果有多個匹配，使用 [1] 或 [2] 或 [3] … 這樣的索引選擇（從 1 開始！）。
獲取 attribute，在元素基礎 xpath 上增加 /@ 表示後面跟的是 attribute，寫全即 //*//<xpath>/@<attribute>。
$x() 執行獲得的有效結果總是 array。
對 $x() 執行得到的 array 結果，如果 xpath 精確的化，一般即是我們想要的在 [0] 位置，且只有這一個。
$x()[0] 中的結果相當於字典，通過 $x()[0]["key"] 取值。

在 scrapy 的 `response` 中使用 xpath - n/a

n/a

常見問題

AttributeError: ‘str’ object has no attribute ‘iter’

在使用 rules + Rule(LinkExtractor(...), ...) 時遇到這個問題。

原因一：

LinkExtractor 中的 restrict_xpaths 期望的是指到“元素”的 xpath，也就說不能在其裏面有 ***/@<attr> 這樣的寫法。
如果想要取得的是 attribute，則定義（放到） LinkExtractor 內的 attrs 參數中去。

部署在 Scrapyd 上

$ pip install scrapyd
$ pip install scrapyd-client
$ pwd
<path>/<to>/<project>
$ vim ./scrapy.cfg
#### uncomment url
[deploy]
url = http://localhost:6800/
project = posts

問題

builtins.AttributeError: ‘int’ object has no attribute ‘splitlines’

參考：https://blog.csdn.net/qq_29719097/article/details/89431234

安裝 scrapyd（scrapyd-client）支持的版本

Scrapy==1.6.0
Twisted==18.9.0

$ pip uninstall twisted
$ pip uninstall scrapy
$ pip install twisted==18.9.0
$ pip install scrapy==1.6.0

scrapydweb – 增強 Scrapyd

n/a

單元測試 - TODO

n/a

分佈式 - TODO

N/A

【爬蟲】（Scrapy）初學 Scrapy 過程中的知識和問題整理

Overview

XPath

在瀏覽器中使用 XPath

在 scrapy 的 `response` 中使用 xpath - n/a

常見問題

AttributeError: ‘str’ object has no attribute ‘iter’

部署在 Scrapyd 上

問題

builtins.AttributeError: ‘int’ object has no attribute ‘splitlines’

scrapydweb – 增強 Scrapyd

單元測試 - TODO

分佈式 - TODO

更多功能

Scrapy 緩存

User-Agent

代理

Scrapy 的設置 – TODO

Reference

PyCharm 使用整理 - 快捷鍵一覽及Ctrl+鼠標滾輪實現字體放大縮小

2019 年 12 個最好的桌面軟件自動化測試工具（注：原文未翻譯）

【實用工具】（Linux）（htop）htop 的安裝與使用

【前端】Solutions for JavaScript/HTML/CSS 整理

【集成自動化測試】github + jenkins 工作流 -- 0%佔位待補

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

【爬蟲】（Scrapy）初學 Scrapy 過程中的知識和問題整理

Overview

XPath

在瀏覽器中使用 XPath

在 scrapy 的 response 中使用 xpath - n/a

常見問題

AttributeError: ‘str’ object has no attribute ‘iter’

部署在 Scrapyd 上

問題

builtins.AttributeError: ‘int’ object has no attribute ‘splitlines’

scrapydweb – 增強 Scrapyd

單元測試 - TODO

分佈式 - TODO

更多功能

Scrapy 緩存

User-Agent

代理

Scrapy 的設置 – TODO

Reference

在 scrapy 的 `response` 中使用 xpath - n/a