Python網絡爬蟲(六)——lxml

Xpath

Xpath 即爲 xml 路徑語言（XML Path Language），它是一種用來確定 XML 文檔中某部分位置的語言，能夠對 XML/HTML 文檔中的元素進行遍歷和查找。

示例 HTML 片段

<?xml version="1.0" encoding="UTF-8"?>
 
<bookstore>
 
<book>
  <title lang="eng">Harry Potter</title>
  <price>29.99</price>
</book>
 
<book>
  <title lang="eng">Learning XML</title>
  <price>39.95</price>
</book>
 
</bookstore>

節點

在 XPath 中，有七種類型的節點：元素、屬性、文本、命名空間、處理指令、註釋以及文檔（根）節點。XML 文檔是被作爲節點樹來對待的。樹的根被稱爲文檔節點或者根節點。

上面的 XML 文檔中：

<bookstore> (文檔節點)

<title lang="eng">Harry Potter</title> (元素節點)

lang="eng" (屬性節點)

Harry Potter (文本節點)

選取節點

表達式	描述	路徑表達式	結果
nodename	選擇此節點的所有子節點	bookstore	選取 bookstore 元素的所有子節點
/	/ 在最前面表示從根節點開始選取否則表示從某節點下開始選取	/bookstore bookstore/book	選取根元素 bookstore 選取屬於 bookstore 的子元素的所有 book 元素
//	從匹配選擇的當前節點選擇文檔中的節點，而不考慮它們的位置	//book bookstore//book	選取所有 book 子元素，而不管它們在文檔中的位置選擇屬於 bookstore 元素的後代的所有 book 元素，而不管它們位於 bookstore 之下的什麼位置
@	選取屬性	//@lang	選取名爲 lang 的所有屬性
.	選取當前節點	./a	選擇當前節點下的 a 標籤

謂語

謂語用來查找某個特定的節點或者包含某個指定的值的節點。謂語被嵌在方括號中。

表達式	結果
/bookstore/book[1]	選取屬於 bookstore 子元素的第一個 book 元素
/bookstore/book[last()]	選取屬於 bookstore 子元素的最後一個 book 元素
/bookstore/book[last()-1]	選取屬於 bookstore 子元素的倒數第二個 book 元素
/bookstore/book[position()<3]	選取最前面的兩個屬於 bookstore 元素的子元素的 book 元素
//title[@lang]	選取所有擁有名爲 lang 的屬性的 title 元素
//title[@lang='eng']	選取所有 title 元素，且這些元素擁有值爲 eng 的 lang 屬性
/bookstore/book[price>35.00]	選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大於 35.00
/bookstore/book[price>35.00]//title	選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大於 35.00

通配符

通配符	描述	路徑表達式	結果
*	匹配任何元素節點	/bookstore/*	選取 bookstore 元素的所有子元素
@*	匹配任何屬性節點	//*	選取文檔中的所有元素
node()	匹配任何類型的節點	//title[@*]	選取所有帶有屬性的 title 元素

選取若干路徑

使用 | 可以選取多個路徑。

路徑表達式	結果
//book/title \| //book/price	選取 book 元素的所有 title 和 price 元素
//title \| //price	選取文檔中的所有 title 和 price 元素
/bookstore/book/title \| //price	選取屬於 bookstore 元素的 book 元素的所有 title 元素，以及文檔中所有的 price 元素

運算符

能夠用於 XPath 中的運算符都有：

運算符	描述	實例	返回值
\|	計算兩個節點集	//book \| //cd	返回所有擁有 book 和 cd 元素的節點集
+	加法	6 + 4	10
-	減法	6 - 4	2
*	乘法	6 * 4	24
div	除法	8 div 4	2
=	等於	price=9.80	如果 price 是 9.80，則返回 true。如果 price 是 9.90，則返回 false。
!=	不等於	price!=9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.80，則返回 false。
<	小於	price<9.80	如果 price 是 9.00，則返回 true。如果 price 是 9.90，則返回 false。
<=	小於或等於	price<=9.80	如果 price 是 9.00，則返回 true。如果 price 是 9.90，則返回 false。
>	大於	price>9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.80，則返回 false。
>=	大於或等於	price>=9.80	如果 price 是 9.90，則返回 true。如果 price 是 9.70，則返回 false。
or	或	price=9.80 or price=9.70	如果 price 是 9.80，則返回 true。如果 price 是 9.50，則返回 false。
and	與	price>9.00 and price<9.90	如果 price 是 9.80，則返回 true。如果 price 是 8.50，則返回 false。
mod	計算除法的餘數	5 mod 2	1

LXML

使用 lxml 庫能夠解析和提取 XML/HTML 數據。並且當 HTML 代碼不規範的話，lxml 能夠進行自動補全。

從字符串讀取 XML 數據

from lxml import etree

text = '''
<div>
    <ul>
         <li class="a"><a href="aaa.html">aaa</a></li>
         <li class="b"><a href="bbb.html">bbb</a></li>
         <li class="c"><a href="ccc.html">ccc</a></li>
         <li class="b"><a href="ddd.html">ddd</a></li>
         <li class="a"><a href="eee.html">eee</a>
     </ul>
 </div>
'''

html = etree.HTML(text)
result = etree.tostring(html)
print(result)

結果爲：

<html><body><div>
    <ul>
         <li class="a"><a href="aaa.html">aaa</a></li>
         <li class="b"><a href="bbb.html">bbb</a></li>
         <li class="c"><a href="ccc.html">ccc</a></li>
         <li class="b"><a href="ddd.html">ddd</a></li>
         <li class="a"><a href="eee.html">eee</a>
     </li></ul>
 </div>
</body></html>

可以看到使用上邊的代碼能夠正確的輸出原始的 XML 文本，並將所缺失的部分進行了補全，添加了 body 和 html 標籤。

從文件中讀取 XML 數據

將之前的 text 文本存儲到文件 text.html 中，然後利用 lxml 讀取該文件中的數據。

from lxml import etree

html = etree.parse('text.html')
result = etree.tostring(html)
print(result.decode('utf-8'))

結果爲：

<div>
    <ul>
         <li class="a"><a href="aaa.html">aaa</a></li>
         <li class="b"><a href="bbb.html">bbb</a></li>
         <li class="c"><a href="ccc.html">ccc</a></li>
         <li class="b"><a href="ddd.html">ddd</a></li>
         <li class="a"><a href="eee.html">eee</a></li>
     </ul>
 </div>

使用這種方式能夠從文件中讀取 XML 數據，但是此時需要將缺失的 XML 標籤補齊，保證 XML 語法正確，不然會報 XMLSyntaxError 錯誤。

使用 XPath 語法

from lxml import etree

html = etree.parse('text.html')

# 獲取所有的 li 標籤
result = html.xpath('//li')
for lable in result:
    print(etree.tostring(lable).decode('utf-8').strip())
print('**********************')

# 獲取所有的 li 標籤下的所有 class 屬性值
result = html.xpath('//li/@class')
for lable in result:
    print(lable)
print('**********************')

# 獲取所有的 li 標籤下 href 爲 ccc.html 的 a 標籤
result = html.xpath("//li/a[@href = 'ccc.html']")
for lable in result:
    print(etree.tostring(lable).decode('utf-8'))
print('**********************')

# 獲取所有的 li 標籤下的 span 標籤
result = html.xpath("//li//span")
for lable in result:
    print(etree.tostring(lable).decode('utf-8'))
print('**********************')

# 獲取所有的 li 標籤下的 a 標籤中的所有 href
result = html.xpath("//li/a//@href")
for lable in result:
    print(lable)
print('**********************')

# 獲取最後一個的 li 標籤下的 a 標籤中的 href
result = html.xpath("//li[last()]/a/@href")
for lable in result:
    print(lable)
print('**********************')

# 獲取倒數第二個 li 元素中的內容
result = html.xpath("//li[last()-1]/a")
for lable in result:
    print(lable.text)
print('**********************')

# 獲取倒數第二個 li 元素中的內容
result = html.xpath("//li[last()-1]/a/text()")
print(result)

結果爲：

<li class="a"><a href="aaa.html">aaa</a></li>
<li class="b"><a href="bbb.html">bbb</a></li>
<li class="c"><a href="ccc.html">ccc</a></li>
<li class="b"><a href="ddd.html">ddd</a></li>
<li class="a"><a href="eee.html">eee</a></li>
**********************
a
b
c
b
a
**********************
<a href="ccc.html">ccc</a>
**********************
**********************
aaa.html
bbb.html
ccc.html
ddd.html
eee.html
**********************
eee.html
**********************
ddd
**********************
['ddd']

需要注意的是 / 和 // 的區別使用。

實例

import requests
from lxml import etree

MAX_PAGES = 50
MAJOR_URL = "https://www.ygdy8.netl"
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36',
    'Referer':'https://www.ygdy8.net/html/gndy/dyzz/index.html'
}

film_url_xpath = "//div[@class = 'co_content8']//tr/td[2]//a/@href"
film_text_xpath = "//div[@class = 'co_content8']//tr/td[2]//a/text()"
film_data_xpath = "//div[@class = 'co_content8']//tr/td[2]//font/text()"
film_detail_xpath = "//div[@class = 'co_content8']//tr/td[@colspan]//text()"


def index_url(index):
    base_url = "https://www.ygdy8.net/html/gndy/dyzz/list_23_.html"
    return base_url.replace('.html',str(index)+'.html')


if __name__ == '__main__':
    html = etree.parse('text.html')
    with open('dyttfilm.txt','w',encoding='utf-8') as fp:
        for i in range(MAX_PAGES):
            url = index_url(i+1)
            response = requests.get(url,headers=headers)
            html = etree.HTML(response.content.decode('gb2312','ignore'))
            film_url = html.xpath(film_url_xpath)
            film_text = html.xpath(film_text_xpath)
            film_data = html.xpath(film_data_xpath)
            film_detail = html.xpath(film_detail_xpath)
            for j in range(len(film_url)):
                item = MAJOR_URL+film_url[j]+'    '+film_text[j]+'    '+film_data[j].split('\n')[0]+'    '+film_detail[j]+'\n\n'
                fp.write(item)
        fp.close()

使用上邊的程序可以爬取電影天堂頁面的電影信息，這裏只保存了前 50 頁的電影信息，在第 97 頁的時候，有一部電影不存在 film_detail 信息，如果提取的頁面過多的時候，需要注意這部分信息。

Python網絡爬蟲(六)——lxml

Xpath

示例 HTML 片段

節點

選取節點

謂語

通配符

選取若干路徑

運算符

LXML

從字符串讀取 XML 數據

從文件中讀取 XML 數據

使用 XPath 語法

實例

記一次 .NET某工業設計軟件崩潰分析

創建 Vue3 項目

TS + Webpack 整合 Jest

分享5款.NET開源免費的Redis客戶端組件庫

安卓手機如何登錄抖音境外版

golang開發 gorilla websocket的使用

面試官：如果不允許線程池丟棄任務，應該選擇哪個拒絕策略？

Mac卸載 Node npm，升級 Node

嵌入式汽車電子學習路線

uni.showModel內容換行

Python網絡爬蟲(二十三)——Redis

Python網絡爬蟲(十九)——CrawlSpider

Python網絡爬蟲(二十四)——Scrapy-Redis

Python網絡爬蟲(二十二)——Downloader Middlewares

Python網絡爬蟲(二十一)——Request 和 Response

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結