爬蟲系列-解析庫

概述

在前面的實例中，我們採用正則表達式來提取相關的信息，但正則表達式過於複雜，容易寫錯，一旦寫錯就可能匹配不到我們想要的東西。所以這次博客我將介紹另一種提取信息的方法-解析庫。
對於網頁的節點來說，它可以定義id、class或其他屬性。而且節點之間還有層次關係，在網頁中可以通過XPath或CSS選擇器來定位一個或多個節點。那麼，在頁面解析時，利用XPath或CSS選擇器來提取某個節點，然後再調用相應方法獲取它的正文內容或者屬性，不就可以提取我們想要的任意信息。
python中裏面有很多強大的解析庫，其中常用的有lxml、Beautiful Soup、pyquery等，但我常用的主要的是lxml、Beautiful Soup，這篇博客主要講的也是這兩種，第三種待我後續有需要的時候，我會專門來寫一篇。

1.Xpath

XPath，全稱XML Path Language，即XML路徑語言，它是一門在XML文檔中查找信息的語言。它最初是用來搜尋XML文檔的，但是它同樣適用於HTML文檔的搜索。在我們看來它就是把網頁當做一張地圖，然後需求的信息當做地圖上某一個，然後通過搜索定位來找到相應的信息。
所以在做爬蟲時，我們完全可以使用XPath來做相應的信息抽取。首先我們就來介紹XPath的基本用法。

xpath常用規則

下面我列出幾個常用的匹配規則：

nodenam  選取此節點的所有子節點
/        從當前節點選取直接子節點
//       從當前節點選取子孫節點
.        選取當前節點
..       選取當前節點的父節點
@        選取屬性

我舉個具體例子讓大家看看怎麼寫xpath:

//title[@class='item']

這個xpath規則代表選擇的是所有名稱爲title，同時屬性class的值爲item的節點。
看完這上面大致有了個瞭解吧，我們下面就可以通過python的解析庫來實踐一下了。

示例

python中常用來作爲xpath解析庫的是lxml庫。下面我聲明一個HTML文本來做測試

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

首先導入lxml庫的etree模塊，然後聲明瞭一段HTML文本，調用HTML類進行初始化，這樣就成功構造了一個XPath解析對象。這裏需要注意的是，HTML文本中的最後一個li節點是沒有閉合的，但是etree模塊可以自動修正HTML文本。

這裏我們調用tostring()方法即可輸出修正後的HTML代碼，但是結果是bytes類型。這裏利用decode()方法將其轉成str類型，其實這個一般在大多數網頁解析是不存在的，網頁一般都是完整的。但我們還是可以看看修正之後的結果：

<html><body><div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </li></ul>
 </div>
</body></html>

修正之後我們才能對它進行解析。還有我們也可以對文本文件進行解析，示例如下，大家可以去網上隨便下載一個網頁來做測試：

from lxml import etree
html = etree.parse('./test.html', etree.HTMLParser())
result = etree.tostring(html)
print(result.decode('utf-8'))

運行結果，我就不想展示了，就是一段HTML文本而已，然後和上面多一個聲明，不過對解析沒任何影響。

所有節點

還是上面的HTML文本，我們要選取所有節點，可以這樣來實現

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//*')
print(result1)

我們可以看一下返回的結果：

[<Element html at 0x162425e3e48>, <Element body at 0x1624272ad48>, <Element div at 0x162426ad508>, <Element ul at 0x16242758488>, <Element li at 0x16242758288>, <Element a at 0x16242758848>, <Element li at 0x16242758888>, <Element a at 0x162427588c8>, <Element li at 0x16242758908>, <Element a at 0x16242758808>, <Element li at 0x16242758948>, <Element a at 0x16242758988>, <Element li at 0x162427589c8>, <Element a at 0x16242758a08>]

我們用*代表匹配所有的節點，也就是整個HTML文本中所有節點都會獲取，我們可以看到返回的是一個列表形式。其實所以節點都包括在列表中。

指定節點

在剛纔那個匹配，我們也可以指定節點名稱，比如獲取所有的a標籤

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//a')
print(result1)

輸出結果爲：

[<Element a at 0x16242758c48>, <Element a at 0x16242758c88>, <Element a at 0x16242758cc8>, <Element a at 0x16242758d08>, <Element a at 0x16242758d48>]

子節點

我們通過/或//即可查找元素的子節點或子孫節點。假如現在想選擇li節點的所有直接a子節點，具體實現：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li/a')
print(result1)

然後輸出的也是所有直接a字節點,//li用於選中的所有的li節點，/a用於選中li節點的所有直接子節點a,二者組合在一起所有直接子節點a,
此處的/用於選取直接子節點，如果要獲取所有子孫節點，就可以使用//。例如，要獲取ul節點下的所有子孫a節點，可以這樣實現：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li//a')
print(result1)

運行結果和上面是相同的。但如果我們採用//ul/a就沒結果，因爲ul下面沒有直接的子節點。
所以我們平時在用的時候要注意//和/的區別，其中/用於獲取直接子節點，//用於獲取子孫節點。

父節點

我們知道連續的/和//可以查找子節點或子孫節點，我們知道子節點希望獲取其父節點可以通過…來實現。
比如下面這個例子，我們希望獲取到a節點的父親節點，並獲取到它的class屬性，其實，相關代碼如下：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//a[@href="link4.html"]/../@class')
print(result1)

運行結果如下：

['item-1']

剛好達到我們的要求，我們還可以通過parent::來獲取父節點，代碼如下：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//a[@href="link4.html"]/parent::*/@class')
print(result1)

結果也是一樣的。

屬性匹配

在選取的時候，我們還可以用@符號進行屬性過濾。比如，這裏如果要選取class爲item-1的li節點，可以這樣實現:

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li[@class="item-0"]')
print(result1)

輸出的結果是：

[<Element li at 0x20fc35e9948>, <Element li at 0x20fc35e9308>]

匹配結果正是兩個，剛好符合我們的要求。

文本獲取

我們用XPath中的text()方法獲取節點中的文本，接下來嘗試獲取前面li節點中的文本，相關代碼如下：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li[@class="item-0"]/text()')
print(result1)

但是我們發現輸出的結果只有一個換行符，因爲li的直接子節點是a標籤，而li標籤中的文本信息只有/n,但是我們如果還是想要其中的文本應該這樣寫：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li[@class="item-0"]/a/text()')
print(result1)

輸出結果剛好達到了我們的要求

['first item', 'fifth item']

我們還可以採用另一種方式寫，也就是//與/的區別，我們可以來看一下：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li[@class="item-0"]//text()')
print(result1)

但是這樣也存在一個問題，結果不是很整潔，包括了換行那一些字符，如果想要數據比較簡潔，推薦使用第一種方法，不然那結果可能類似是這樣的

['first item', 'fifth item', '\n     ']

屬性獲取

比如有的時候，我們在爬取網頁相關的鏈接的時候，url的鏈接可能在某個標籤中的href屬性中，這個時候我們就需要提取屬性值了，我們可以看看怎麼用，還是用上一個爲例子：

from lxml import etree
html = etree.HTML(result)
result1 = html.xpath('//li/a/@href')
print(result1)

結果是：

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

這樣我們就實現了提取多個屬性值了。

屬性多值匹配

有的時候，某些節點的某個屬性可能有多擱置，我們要用到contains()函數了，代碼可以改寫如下：

from lxml import etree
text = '''
<li class="li li-first"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li")]/a/text()')
print(result)

運行結果是：

['first item']

這種方法適合某個節點有多個屬性值時經常用到。

多屬性匹配

有的時候我們可能會遇到根據多個屬性確定一個節點，但單獨一個屬性我們可能匹配不到，所以我們需要同時匹配多個屬性，我們可以用and運算符來連接。如下：

from lxml import etree
text = '''
<li class="li li-first" name="item"><a href="link.html">first item</a></li>
'''
html = etree.HTML(text)
result = html.xpath('//li[contains(@class, "li") and @name="item"]/a/text()')
print(result)

運行結果爲：

['first item']

其中li節點包括兩個屬性，我們要確定這個節點需要兩個屬性匹配。
這裏面and是xpath中的運算符，下面有很多運算符，我介紹幾個常用的：

or  或   age=19 or age=20   如果age是19，則返回true。如果age是21，則返回false
and  與   age>19 and age<21  如果age是20，則返回true。如果age是18，則返回false
|  計算兩個節點集    //book | //cd 返回所有擁有book和cd元素的節點集

想了解更多可以查看官方文檔。

按序選擇

有時候我們在選擇某些屬性可能同時匹配多個節點，但是隻想要其中的某個節點，如果只想要最後一個或者其他節點該怎麼辦？比如找下一頁的標籤時候，我們可能需要定位最後一個，這個時候就需要按序選擇。

from lxml import etree
text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/a/text()')
print(result)
result = html.xpath('//li[last()]/a/text()')
print(result)
result = html.xpath('//li[position()<3]/a/text()')
print(result)
result = html.xpath('//li[last()-2]/a/text()')
print(result)

運行結果是：

['first item']
['fifth item']
['first item', 'second item']
['third item']

第一次選擇時，我們選擇了第一個li節點，括號中傳入1就行啦，注意區分這個不是0開頭和列表等不同。
最後一個一般用last()來表示。
倒數第二個可以用last()-1來表示。
假如我們選擇位置小於3的節點，可以用position()<3來表示。

補充用法

xpath除了提供這些基礎用法，還有很多比如下面這些：

from lxml import etree

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html"><span>first item</span></a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''
html = etree.HTML(text)
result = html.xpath('//li[1]/ancestor::*')
print(result)
result = html.xpath('//li[1]/ancestor::div')
print(result)
result = html.xpath('//li[1]/attribute::*')
print(result)
result = html.xpath('//li[1]/child::a[@href="link1.html"]')
print(result)
result = html.xpath('//li[1]/descendant::span')
print(result)
result = html.xpath('//li[1]/following::*[2]')
print(result)
result = html.xpath('//li[1]/following-sibling::*')

運行結果是：

[<Element html at 0x24aa34ea448>, <Element body at 0x24aa3500e48>, <Element div at 0x24aa351d248>, <Element ul at 0x24aa351d308>]
[<Element div at 0x24aa351d248>]
['item-0']
[<Element a at 0x24aa351d348>]
[<Element span at 0x24aa351d148>]
[<Element a at 0x24aa3500e48>]

1.ancestor軸，可以獲取所有祖先節點。其後需要跟兩個冒號，然後是節點的選擇器，這裏我們直接使用*，表示匹配所有節點，因此返回結果是第一個li節點的所有祖先節點，包括html、body、div和ul。
2.attribute軸，可以獲取所有屬性值，其後跟的選擇器還是*，這代表獲取節點的所有屬性，返回值就是li節點的所有屬性值。
3.child軸，可以獲取所有直接子節點。這裏我們又加了限定條件，選取href屬性爲link1.html的a節點。
4.descendant軸，可以獲取所有子孫節點。這裏我們又加了限定條件獲取span節點，所以返回的結果只包含span節點而不包含a節點。
5.following軸，可以獲取當前節點之後的所有節點。這裏我們雖然使用的是匹配，但又加了索引選擇，所以只獲取了第二個後續節點。
6.following-sibling軸，可以獲取當前節點之後的所有同級節點。這裏我們使用匹配，所以獲取了所有後續同級節點。
如果想了解更多關於xpath語法，我們可以去官網查閱。

2.Beautiful Soup

上面介紹了一種通過XPath選擇器來定位一個或多個節點。那麼，在頁面解析時，利用XPath選擇器來提取某個節點，然後再調用相應方法獲取它的正文內容或者屬性，不就可以提取我們想要的任意信息。這裏我們介紹另外一種相似的通過網頁的的特殊結構和層級關係等，下面我介紹另外一種解析工具Beautiful Soup。

概述

Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫.它能夠通過你喜歡的轉換器實現慣用的文檔導航,查找,修改文檔的方式.Beautiful Soup會幫你節省數小時甚至數天的工作時間.（這段話我引用的是官方文檔的）

解析器

Beautiful Soup在解析時實際上依賴解析器，它除了支持Python標準庫中的HTML解析器外，還支持一些第三方解析器（比如lxml）。下面我列出Beautiful Soup支持的幾種解析器。

Python標準庫      BeautifulSoup(markup, "html.parser")                 Python的內置標準庫、執行速度適中、文檔容錯能力強         Python 2.7.3及Python 3.2.2之前的版本文檔容錯能力差
lxml HTML解析器   BeautifulSoup(markup, "lxml")       速度快、文檔容錯能力強   需要安裝C語言庫
lxml XML解析器    BeautifulSoup(markup, "xml")        速度快、唯一支持XML的解析器  需要安裝C語言庫
html5lib          BeautifulSoup(markup, "html5lib")    最好的容錯性、以瀏覽器的方式解析文檔、生成HTML5格式的文檔速度慢、不依賴外部擴展

但一般常用的是lxml解析器，它有解析HTML和XML的功能，而且速度快，容錯能力強，如果使用lxml,在初始化的時候Beautiful Soup時，可以把第二個參數改爲lxml即可，其他都是一致的。
後面我也是以這個解析器爲參考。
下面看看具體用法，我們就能瞭解大概怎麼用了。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.prettify())
print(soup.title.string)

輸出結果一個是修正後的完整的上面HTML文本，並且以標準縮進輸出和一個titile標籤中的文本，但prettify()這個方法可以把要解析的字符串以標準的縮進格式輸出。這裏需要注意的是，輸出結果裏面包含body和html節點，也就是說對於不標準的HTML字符串BeautifulSoup，可以自動更正格式。這一步不是由prettify()方法做的，而是在初始化BeautifulSoup時就完成了。
然後調用soup.title.string，這實際上是輸出HTML中title節點的文本內容。所以，soup.title可以選出HTML中的title節點，再調用string屬性就可以得到裏面的文本了。

節點選擇器

選擇元素

還是以上面那段HTML文本爲例，下面我來直接調用節點名稱就可以選擇節點元素，甚至提取元素，這種選擇方式速度非常快，可以用這種方法怎麼用：

soup = BeautifulSoup(html, 'lxml')
print(soup.title)#打印title節點的選擇結果
print(type(soup.title))#打印title節點的類型
print(soup.title.string)#打印title節點中的文本
print(soup.p)#選擇p標籤，但只能選擇第一個匹配到的內容，後面的都會忽略
print(soup.head)

返回結果是：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<head><title>The Dormouse's story</title></head>

提取信息

我們可以通過string屬性來獲取文本的值，我們有的時候需要獲取節點中屬性的名稱和節點的名稱。但其實節點名稱這個有的不是很多，但我也簡單介紹一下，還是以上一段HTML文本爲例
1.獲取節點名稱，我們獲取p節點的名稱，可以這樣做:

print(soup.p.name)

結果就不用說了吧，就是p。
2.獲取屬性，每個節點可能有多個屬性，常有class，href,id等，選擇節點元素後，調用attr獲取所有屬性：

print(soup.p.attrs)#返回的是一個字典，屬性和屬性值鍵值對形式
print(soup.p.attrs['name'])#類似從字典中獲取某個鍵值

返回結果：

{'class': ['title'], 'name': 'dromouse'}
dromouse

其實我們還可以稍微簡化一下，畢竟這個還是有一點繁瑣的。其實我們可以不用寫attrs,直接在節點元素後面加入中括號，傳入屬性名就可以獲取屬性值。

print(soup.p['name'])
print(soup.p['class'])

運行結果：

dromouse
['title']

這樣看，也達到上面的效果但我們要注意屬性值返回的類型，防止數據格式不同。

獲取內容

上面的樣例也展示過了，用string提取節點元素包括的文本，比如要需要獲取第一個p節點裏面的文本。

print(soup.p.string)

返回的結果是：

The Dormouse's story

這裏我們還是要注意返回的是第一個匹配的結果，後面的都不會返回。

嵌套選擇

結合上面，我們可以知道返回的結果都是bs4.element.Tag類型，它同時可以調用節點進行下一步的選擇。比如我們捕獲了head節點元素，我們可以調用head來選擇器head內部的head節點元素：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title)
print(type(soup.head.title))
print(soup.head.title.string)

返回的結果是：

<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
The Dormouse's story

我們先是調用head之後再次調用title而選擇的title節點元素。然後打印輸出了它的類型，可以看到，它仍然是bs4.element.Tag類型。也就是說，我們在Tag類型的基礎上再次選擇得到的依然還是Tag類型，每次返回的結果都相同，所以這樣就可以做嵌套選擇了。
最後，輸出它的string屬性，也就是節點裏的文本內容。

關聯選擇

在做選擇的時候，有時候不能做到一步就選到想要的節點元素，需要先選中某一個節點元素，然後以它爲基準再選擇它的子節點、父節點、兄弟節點等，這裏就來介紹如何選擇這些節點元素。

選取子節點或子孫節點

選取節點元素之後，如果想獲取它的直接子節點，可以利用contents屬性，示例如下：

from bs4 import BeautifulSoup
html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

返回的結果：

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']

從結果中我們可以看到返回的都是p標籤的直接子節點，p標籤中的孫子節點span沒有保存下來，也說明contents返回都是直接節點。
還有一種方法也可以得到這樣的效果，就是children屬性，我們可以調用children屬性來得到相應結果，只不過它返回的是生成器類型，我們需要for循環輸出：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

返回結果是：

<list_iterator object at 0x000001667B0772B0>

            Once upon a time there were three little sisters; and their names were
            
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>


<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 
            and
            
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

            and they lived at the bottom of a well.

不同的方法但最終的結果都是一樣的。

獲取所有的子孫節點

要得到所有的子孫節點的話，可以調用descendants屬性：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(child)

返回的結果是：

<generator object descendants at 0x000001667ABFD4C0>

            Once upon a time there were three little sisters; and their names were
            
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>


<span>Elsie</span>
Elsie




<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
Lacie
 
            and
            
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
Tillie

            and they lived at the bottom of a well.

父節點和祖先節點

要獲取某個節點的父親和祖先節點用的是parent和parents屬性。parent屬性返回的是某個節點的父親節點，parents返回的是某個節點的祖先節點，是一個生成器。

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)#返回的僅僅是父節點
print(list(soup.a.parents))#返回的是祖先節點，一個生成器，需要我們先將其轉成列表

返回結果是：

<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
[<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>, <body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body>, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>, <html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="story">
            Once upon a time there were three little sisters; and their names were
            <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
</p>
<p class="story">...</p>
</body></html>]

兄弟節點

獲取兄弟節點採用的是這4個屬性，next_sibling和previous_sibling分別獲取節點的下一個和上一個兄弟元素，next_siblings和previous_siblings則分別返回所有前面和後面的兄弟節點的生成器。

html = """
<html>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            Hello
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print('Next Sibling', soup.a.next_sibling)
print('Prev Sibling', soup.a.previous_sibling)
print('Next Siblings', list(enumerate(soup.a.next_siblings)))
print('Prev Siblings', list(enumerate(soup.a.previous_siblings)))

返回結果是：

Next Sibling 
            Hello
            
Prev Sibling 
            Once upon a time there were three little sisters; and their names were
            
Next Siblings [(0, '\n            Hello\n            '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
Prev Siblings [(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

上面這幾個我們要記住，如果結果是一個生成器，記得先轉爲列表，再來進行後續的操作，比如提取屬性、文本,這和前面是一樣的，我就不重複了。

方法選擇器

前面所講的選擇方法都是通過屬性來選擇的，這種方法非常快，但是如果進行比較複雜的選擇的話，它就比較煩瑣，不夠靈活了。幸好，Beautiful Soup還爲我們提供了一些查詢方法，比如find_all()和find()等，調用它們，然後傳入相應的參數，就可以靈活查詢了。

find_all()

看這個函數我們就知道查詢所有符合條件的元素，我們給它傳入一些參數，就可以得到符合條件的元素。函數形式：

find_all(name , attrs , recursive , text , **kwargs)

我們先找一段測試HTML文本：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

我們根據函數形式，對它舉一些具體實例來了解它的參數用法。
1.直接獲取節點

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(name='ul'))
print(type(soup.find_all(name='ul')[0]))

返回結果是：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>, <ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>]
<class 'bs4.element.Tag'>

2.根據屬性(attrs)l來查詢：

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(id='list-1'))
print(soup.find_all(class_='element'))
print(soup.find_all(attrs={'id': 'list-1'}))
print(soup.find_all(attrs={'class': 'element'}))

返回結果爲：

[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

3.提取節點中的文本：

import re
html='''
<div class="panel">
    <div class="panel-body">
        <a>Hello, this is a link</a>
        <a>Hello, this is a link, too</a>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text=re.compile('link')))
print(soup.find_all('a')[0].text)

返回結果是：

['Hello, this is a link', 'Hello, this is a link, too']
Hello, this is a link

兩個a節點，其內部包含文本信息。這裏在find_all()方法中傳入text參數，該參數爲正則表達式對象，結果返回所有匹配正則表達式的節點文本組成的列表。我們還可以直接用.text形式提取文本。

find()

除了find_all()方法，還有find()方法，只不過後者返回的是單個元素，也就是第一個匹配的元素，而前者返回的是所有匹配的元素組成的列表。示例如下：

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find(name='ul'))
print(type(soup.find(name='ul')))
print(soup.find(class_='list'))

返回結果：

<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<class 'bs4.element.Tag'>
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>

這裏的返回結果不再是列表形式，而是第一個匹配的節點元素，類型依然是Tag類型。
除此之外還有很多相關的方法選擇器
find_parents()和find_parent()：前者返回所有祖先節點，後者返回直接父節點。
find_next_siblings()和find_next_sibling()：前者返回後面所有的兄弟節點，後者返回後面第一個兄弟節點。
find_previous_siblings()和find_previous_sibling()：前者返回前面所有的兄弟節點，後者返回前面第一個兄弟節點。
find_all_next()和find_next()：前者返回節點後所有符合條件的節點，後者返回第一個符合條件的節點。
find_all_previous()和find_previous()：前者返回節點後所有符合條件的節點，後者返回第一個符合條件的節點。
想了解更多，可以去官網查閱。這裏我只大致介紹一下，上面的選擇器都是通過節點等結構方式來處理，其實還有很多選擇器，比如CSS選擇器。這裏我不做講解，但一般掌握正則表達式，Beautiful Soup，Xpath三種也就差不多，有興趣的可以去了解CSS選擇器。

爬蟲系列-解析庫

文章目錄

概述

1.Xpath

xpath常用規則

示例

所有節點

指定節點

子節點

父節點

屬性匹配

文本獲取

屬性獲取

屬性多值匹配

多屬性匹配

按序選擇

補充用法

2.Beautiful Soup

概述

解析器

節點選擇器

選擇元素

提取信息

獲取內容

嵌套選擇

關聯選擇

選取子節點或子孫節點

獲取所有的子孫節點

父節點和祖先節點

兄弟節點

方法選擇器

find_all()

find()

numpy科學計算入門

Python-matplotlib入門--基礎圖表的繪製（持續更新中）

數據可視化概要

python 函數進階筆記

爬蟲系列-解析庫

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結