幾行代碼能搞定的事不必去點鼠標

文章更新於：2020-04-01
注1： bs4 庫就是 BeautifulSoup庫，版本4.x
注2：本文根據 bs4 官網文檔：Beautiful Soup Documentation 進行講解

文章目錄

一、bs4 庫簡介

二、小試牛刀：本節課講如何使用 bs4

三、初露鋒芒：進階使用 bs4

3、如何在 Tag 間遨遊

四、更近一步：精準匹配

五、漸入佳境：得心應手

六、心浮氣躁：靜下心來

1、初識 find

七、Enjoy！

一、bs4 庫簡介

1、bs4 庫是什麼

bs4 庫是一個 python 庫，你可以理解爲一個插件。

2、bs4 庫能幹什麼

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

官方說了： Beautiful Soup 是一個能叫數據從 HTML 或 XML 文件中弄出來了 python 庫。

什麼數據？你想要的數據。
你想要什麼數據？這個嘛，我不知道~

3、bs4 庫怎麼安裝

簡單，使用 pip 包管理器進行安裝。

pip install bs4

什麼？你沒有 pip ？
那玩意可是安裝 python 的時候就有的啊！
如果命令 not found 可能是你沒配置環境變量。

4、bs4 庫怎麼使用

下節課講，同學們下課！

二、小試牛刀：本節課講如何使用 bs4

1、將代碼變的易讀

這個“庫子”庫如其名，美麗湯，是用來煲湯的。就算你不用提取數據，你也可以讓你獲取到的 HTMl 代碼變的更易讀一些。

標籤都搞在一起真難看，哦不，是真難讀。
看如下例子：

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())

上述代碼導入了 bs4 庫，然後創建了一個HTML文檔字符串，之後根據其創建了一個 BeautifulSoup 對象，最後進行 prettify()輸出。

注意啊，最主要的就是這個 prettify() ，就是它讓代碼進行格式化的。
執行結果如下圖：

2、對標籤進行查找

如果只是讓代碼變好看一點也太沒意思了。代碼寫來是讓用的，又不是主要用來讀的。
能在這段代碼中迅速找到我需要的標籤及其內容纔是關鍵。

在上一個例子代碼的基礎上，後續也是，不在贅述

print(soup.title)
print(soup.title.name)
print(soup.title.parent.name)
print(soup.p)
print(soup.p['class'])
print(soup.a)
print(soup.find_all('a'))
print(soup.find_all(id='link3'))

這裏的代碼我沒有註釋，但是你一看應該就知道什麼意思。
執行結果如下圖：

注：其中第七行沒有截圖完整，它是一個列表。代碼查詢了所有 a 標籤並輸出。

3、一個查找網頁所有超鏈接的案例

for link in soup.find_all('a'):
    print(link.get('href'))

執行結果如下圖：

4、獲取網頁所有文本

print(soup.get_text())

執行結果如下圖：

5、總結

怎麼樣，很簡單吧。
如果感覺還可以，繼續往下看。

這時我們可以瞭解一下 解析器 的概念了。解析器是 bs4 用來解析 HTML 文檔的，它默認使用的是 python 的標準庫，也就是上述代碼所用到的 html.parser。

這個解析器或者說庫是你安裝python的時候就有的，但是你仍然可以安裝一些更爲強大的第三方解析器，比如說：lxml、html5lib。

如果你想安裝 lxml 等解析器，你可以這樣做：

# 使用 pip 進行安裝解析器
pip install lxml

配一個從官方弄來的圖：

三、初露鋒芒：進階使用 bs4

1、我拿什麼來做湯

好湯自然要好料，可是直接傳遞 HTML ，也可以傳遞一個打開的文件句柄

# 傳遞一個打開的文件句柄
from bs4 import BeautifulSoup

with open('index.html') as fp:
    soup = BeautifulSoup(fp)

# 傳遞一個 HTML 文檔
soup = BeautifulSoup("<html>data</html>")

注：這裏的代碼截圖和上述代碼有少許出入，但表達意思相同。

2、湯裏面都是有什麼

1) 有 Tag

注：這個 Tag 就是指 HTML 裏面的那個 Tag ，但是他們的類型不同！

soup = BeautifulSoup('<b class="boldest">Extremely bold</b>')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>

這個 Tag 有很多屬性，比如

屬性	描述
`name`	`tag.name`
`id`	`tag['id']`

注1：如上表， Tag 有很多屬性，你可以像對待字典一樣獲取這些 Tag 的屬性（用 ['key']的方式）。
注2：你可以添加、修改、刪除這些屬性，而不僅僅只是查看。（修改的話直接賦值即可）

示例：

tag['id'] = 'verybold'
tag['another-attribute'] = 1
tag
# <b another-attribute="1" id="verybold"></b>

del tag['id']
del tag['another-attribute']
tag
# <b></b>

tag['id']
# KeyError: 'id'
print(tag.get('id'))
# None

2) 如果 Tag 有多個屬性怎麼辦

沒關係，BeautifulSoup 會返回一個列表以包含這些值。

css_soup = BeautifulSoup('<p class="body"></p>')
css_soup.p['class']
# ["body"]

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

但是，如果有些屬性只是長得像多個屬性，但實際不是。怎麼辦
沒關係，這種情況不會誤判：

id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

但有時你確實是一個屬性被當做兩個了，怎麼辦
你可以通過指明 multi_valued_attributes=None 來表明一下你的態度：

no_list_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html', multi_valued_attributes=None)
no_list_soup.p['class']
# u'body strikeout'

你還可以通過指定 get_attribute_list參數來表示你想要返回列表形式的結果
可以，滿足你：

id_soup.p.get_attribute_list('id')
# ["my id"]

如果你指明瞭要當做 xml 來解析，它也不會被當做多個值：

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'

問題又來了，如果你指明 xml 但是還想讓其處理多個值怎麼辦
好辦，說出是你的想法（帶上 multi_valued_attributes=class_is_multi ）：

class_is_multi= { '*' : 'class'}
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)
xml_soup.p['class']
# [u'body', u'strikeout']

3) 替換字符串

tag.string.replace_with("No longer bold")
tag
# <blockquote>No longer bold</blockquote>

3、如何在 Tag 間遨遊

假如這裏有一個這樣的文檔：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

1) 直接喊名字

是什麼就喊什麼，多簡單的事兒

soup.head
# <head><title>The Dormouse's story</title></head>

soup.title
# <title>The Dormouse's story</title>

2) 順着他爹往下找

soup.body.b
# <b>The Dormouse's story</b>

3) 如果有多個 Tag 匹配值只返回第一個

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

4) 如果你想返回所有匹配值

麼問題，那你用 find_all('Tag_nam')就行了

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

5) 有多少個子孫：禁止套娃！

一個 Tag 有多少個後代都在家譜（.contents）的列表（List）裏面記着：

head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']

類似的還有 .children ，不同在於 children 不包含文本和註釋節點

6) 遞歸迭代子孫：可以套娃！

.contents 和 .children 只管兒，孫子及以下都不管了。
舉個例子：

head_tag.contents
# [<title>The Dormouse's story</title>]

上述代碼中，兩個 <title> 中間的內容可以看做是 <title> 的兒子，但是 contents 和 children 它不管啊！

如果想管，還是 descendants 靠譜：

for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

7) 底層人員的活路在哪

如果一個 Tag 只有一個 child，而且還是個 NavigableString 類型。那麼，可憐可憐，給你一個 .string 屬性吧。

title_tag.string
# u'The Dormouse's story'

那麼問題來了，如果一個 Tag 只有一個 child，正好這個 child 只有一個 NavigableString 類型的 child，怎麼辦？
行吧，都是難民，一樣的待遇吧，
通過它爺可以直接通過 .string 訪問它：

head_tag.contents
# [<title>The Dormouse's story</title>]

head_tag.string
# u'The Dormouse's story'

但是如果你有多個後代的時候，這個時候 .string 就不明確了對不對，怎麼辦？
攤手，返回 None 吧，我也沒辦法：

print(soup.html.string)
# None

可是如果我就是想一下子訪問所有的 .string 怎麼辦？
那只有想辦法嘍，來個 .strings 屬性？
好，就這麼愉快的決定了：

for string in soup.strings:
    print(repr(string))
# u"The Dormouse's story"
# u'\n\n'
# u"The Dormouse's story"
# u'\n\n'
# u'Once upon a time there were three little sisters; and their names were\n'
# u'Elsie'
# u',\n'
# u'Lacie'
# u' and\n'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# u'...'
# u'\n'

我去，你這拿好多空白符來糊弄我？
怎麼辦？
幹他！
上金鐘罩 stripped_strings 把空白符全乾了~

for string in soup.stripped_strings:
    print(repr(string))
# u"The Dormouse's story"
# u"The Dormouse's story"
# u'Once upon a time there were three little sisters; and their names were'
# u'Elsie'
# u','
# u'Lacie'
# u'and'
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'...'

8) 不行，有人上訪

光兒子、孫子了，怎麼沒聽見有人喊爸爸？
你看，說爸爸，爸…

通過 .parent可以上訪，這是正當權利！

title_tag = soup.title
title_tag
# <title>The Dormouse's story</title>
title_tag.parent
# <head><title>The Dormouse's story</title></head>

最底層的 string 的爸爸是它的上一級標籤：

title_tag.string.parent
# <title>The Dormouse's story</title>

那最頂層的爸爸是誰？
是如來！
金身在此！

html_tag = soup.html
type(html_tag.parent)
# <class 'bs4.BeautifulSoup'>

那如來了爸爸是誰？
你在無中生有，胡編亂造~~

print(soup.parent)
# None

什麼？你竟然嫌爸爸少？
那好，
爸爸們來了！
通過 .parents 可以訪問爸爸們~

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

9) 有事還得親兄弟

光說兒子爸爸了，搞得兄弟好像不親似的。
假如有這麼一個文檔;

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
print(sibling_soup.prettify())
# <html>
#  <body>
#   <a>
#    <b>
#     text1
#    </b>
#    <c>
#     text2
#    </c>
#   </a>
#  </body>
# </html>

你可以通過 next_sibling 訪問弟弟，通過 previous_sibling 訪問大哥：

sibling_soup.b.next_sibling
# <c>text2</c>

sibling_soup.c.previous_sibling
# <b>text1</b>

那大哥的大哥是誰？小弟的小弟是誰？
挖草，你又在無中生有：

print(sibling_soup.b.previous_sibling)
# None
print(sibling_soup.c.next_sibling)
# None

不過說清了啊，不一個爸爸可不能算兄弟哦：

sibling_soup.b.string
# u'text1'

print(sibling_soup.b.string.next_sibling)
# None

還記得那年的夏鳴湖畔嗎？
你以爲第一個 <a> 標籤的 .next_sibling 是下一個 <a>，
你錯了！
你以爲我換行符 \n 不存在的嗎？

link = soup.a
link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
# u',\n'

想找你的兄弟？
下一次吧！

link.next_sibling.next_sibling
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>

10) 梁山聚義當多人

弟弟們哥哥們何在？
挨個迭代，今天誰也跑不掉！

for sibling in soup.a.next_siblings:
    print(repr(sibling))
# u',\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u' and\n'
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
# u'; and they lived at the bottom of a well.'
# None

for sibling in soup.find(id="link3").previous_siblings:
    print(repr(sibling))
# ' and\n'
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
# u',\n'
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
# u'Once upon a time there were three little sisters; and their names were\n'
# None

四、更近一步：精準匹配

假如有這麼一個文檔：

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

1、當一個快樂的伐樹工

此樹非彼樹，此 DOM 樹是也。

soup = Beautiful(html, 'html.parser')

# 查找第一個符合條件的值
soup.find('a')

# 查找所有符合條件的值， 返回列表
soup.find_all('a')

2、正則倚天劍

如果你提供一個正則表達式對象參數， bs4 將會使用正則的 search() 函數進行查找。

# 這個例子查找所有以 b 開頭的標籤
import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

# 這個例子查找所有以 t 開頭的標籤
for tag in soup.find_all(re.compile("t")):
    print(tag.name)
# html
# title

3、列表匹配全

如果你提供一個列表參數， bs4 將匹配所有符合的值

# 在這個例子中，將匹配到所有的 a 標籤和 b 標籤
soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

4、當真主降臨

如果你提供一個 True 作爲參數，那麼將會匹配到所有的 Tags

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

5、函數大法好

如果上面的各種匹配方法都不適合你，那麼，咱自己定義一個模式吧？

你可以定義一個函數，來判斷一個標籤是否符合你想要的標準，然後返回一個布爾值。
比如你可以這樣：

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
    
soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were…bottom of a well.</p>,
#  <p class="story">...</p>]

你還可以根據 Tag 的屬性來定義匹配函數：

def not_lacie(href):
    return href and not re.compile("lacie").search(href)
soup.find_all(href=not_lacie)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

再來一個例子，判斷一個 Tag 是否被字符對象包圍：

from bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name
# p
# a
# a
# a
# p

五、漸入佳境：得心應手

find_all() 煩的奧，下面幾個例子都以此爲基礎。

1、檢索 Tag

# 用法:find_all(name, attrs, recursive, string, limit, **kwargs)
soup.find_all("title")
# [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]

soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# 需要注意的是，有些屬性是不可以直接用的，像 data-xxx 比如：
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

# 還有一些例外比如：
name_soup = BeautifulSoup('<input name="email"/>')
name_soup.find_all(name="email")
# []
name_soup.find_all(attrs={"name": "email"})
# [<input name="email"/>]

2、`css` 選擇器

你可以使用 css 選擇器檢索符合你指定 css class 的標籤。

但由於 class 是 python 的保留字，所以 bs4 使用 class_ 來代替。

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

事實上，你可以使用字符串、正則、函數或 True 作爲 class_ 的值:

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

注：當一個 Tag 有多個 css class 屬性的時候，你指定其中一個就可以中標！

# 如果一個 Tag 有多個 css class 屬性的時候，你可以寫完：
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

# 但是如果你順序寫反了，你會竹籃打水一場空
css_soup.find_all("p", class_="strikeout body")
# []

# 所以如果你想同時指定多個 css class 還不想一場空，你可以使用 css 選擇器：
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]

3、Tag 並不是你的唯一

你還可以直接檢索 string

soup.find_all(string="Elsie")
# [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

soup.find_all("a", string="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

4、limit 限制檢索的個數

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

5、recursive 開始套娃

你可以通過 mytag.find_all() 的方式來指定檢索某個標籤下的 Tag，如果同時指定了 recursive=False 則不進行遞歸。

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
...

soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

6、find_all 的簡化

雖然 find_all 已經那麼好用了，但是官方竟然又把它簡化了。

# 下面兩句作用相同
soup.find_all("a")
soup("a")

# 下面兩句作用相同
soup.title.find_all(string=True)
soup.title(string=True)

六、心浮氣躁：靜下心來

雖然 find_all 異常好用，但是有時候比如你只想得到第一個結果。與其每次使用 find_all 的同時指定 limit = 1 ，不如直接使用 find()。

1、初識 find

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

與 find_all() 的不同：

find_all() 返回列表，find() 直接返回結果
無結果時，find_all() 返回空列表，find() 返回 None

七、Enjoy！

有後續…