BeautifulSoup 使用指北 - 0x03_搜索解析樹

GitHub@orca-j35，所有筆記均託管於 python_notes 倉庫。
歡迎任何形式的轉載，但請務必註明出處。
參考: https://www.crummy.com/softwa...

概述

BeautifulSoup 中定義了許多搜索解析樹的方法，但這些方法都非常類似，它們大多采用與 find_all() 相同的參數: name、attrs、string、limit 和 **kwargs，但是僅有 find() 和 find_all() 支持 recursive 參數。

這裏着重介紹 find() 和 find_all()，其它"搜索方法"也這兩個類似。

Three sisters

本節會以 "three sister" 作爲示例:

html_doc = """
<html>
<head>
    <title>The Dormouse's story</title>
</head>
<body>
    <p class="title"><b>The Dormouse's story</b></p>
    <p class="story">Once upon a time there were three little sisters; and their names were
        <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
        <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
        <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
        and they lived at the bottom of a well.
    </p>

    <p class="story">...</p>
"""
from pprint import pprint
from bs4 import BeautifulSoup
import re

soup = BeautifulSoup(html_doc, 'html.parser')

過濾器

過濾器(filter)用於在解析樹中篩選目標節點，被用作"搜索方法"的實參。

字符串

字符串可用作過濾器，BeautifulSoup 可利用字符串來篩選節點，並保留符合條件節點:

使用字符串篩選 tag 時，會保留與字符串同名 tag 節點，且總會過濾掉 HTML 文本節點
使用字符串篩選 HTML 屬性時，會保留屬性值與字符串相同的 tag 節點，且總會過濾掉 HTML 文本節點
使用字符串篩選 HTML 文本時，會保留與字符串相同的文本節點

與 str 字符串類似，我們還可將 bytes 對象用作過濾器，區別在於 BeautifulSoup 會假定編碼模式爲 UTF-8。

示例:

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找名爲b的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all('b')])
print([f"{type(i)}::{i.name}" for i in soup.find_all(b'b')])
# 查找id值爲link1的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(id='link1')])
# 查找文本值爲Elsie的文本節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(text='Elsie')])

輸出:

["<class 'bs4.element.Tag'>::b"]
["<class 'bs4.element.Tag'>::b"]
["<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.NavigableString'>::None"]

正則表達式

正則表達式對象可用作過濾器，BeautifulSoup 會利用正則表達式對象的 search() 方法來篩選節點，並保留符合條件節點:

使用正則表達式對象篩選 tag 時，會利用正則表達式的 search() 方法來篩選 tag 節點的名稱，並保留符合條件的 tag 節點。因爲文本節點的 .name 屬性值爲 None，因此總會過濾掉 HTML 文本節點
使用正則表達式對象篩選 HTML 屬性時，會利用正則表達式的 search() 方法來篩選指定屬性的值，並保留符合條件的 tag 節點。因爲文本節點不包含任何 HTML 屬性，因此總會過濾掉 HTML 文本節點
使用正則表達式對象篩選 HTML 文本時，會利用正則表達式的 search() 方法來篩選文本節點，並保留符合條件的文本節點。

示例:

import re

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找名稱中包含字母b的節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(re.compile(r'b'))])
# 查找class值以t開頭的tag
print(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=re.compile(r'^t'))])
# 查找文本值以E開頭的文本節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(text=re.compile(r'^E'))])

輸出:

["<class 'bs4.element.Tag'>::body", "<class 'bs4.element.Tag'>::b"]
["<class 'bs4.element.Tag'>::p"]
["<class 'bs4.element.NavigableString'>::None"]

列表

列表 list 可用作過濾器，列表中的項可以是:

字符串
正則表達式對象
可調用對象，詳見函數

BeautifulSoup 會利用列表中的項來篩選節點，並保留符合條件節點:

使用列表篩選 tag 時，若 tag 名與列表中的某一項匹配，則會保留該 tag 節點，且總會過濾掉 HTML 文本節點
使用列表篩選 HTML 屬性時，若屬性值與列表中的某一項匹配，則會保留該 tag 節點，且總會過濾掉 HTML 文本節點
使用列表篩選 HTML 文本時，若文本與列表中的某一項匹配，則會保留該文本節點

示例

import re
def func(tag):
    return tag.get('id') == "link1"

soup = BeautifulSoup(html_doc, 'html.parser')
# 查找與列表匹配的tag節點
tag = soup.find_all(['title', re.compile('b$'), func])
pprint([f"{type(i)}::{i.name}" for i in tag])
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(text=["Elsie", "Tillie"])])

輸出:

["<class 'bs4.element.Tag'>::title",
 "<class 'bs4.element.Tag'>::b",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None"]

True

布爾值 True 可用作過濾器:

使用 True 篩選 tag 時，會保留所有 tag 節點，且過濾掉所有 HTML 文本節點
使用 True 篩選 HTML 屬性時，會保留所有具備該屬性的 tag 節點，且過濾掉所有 HTML 文本節點
使用 True 篩選 HTML 文本時，會保留所有文本節點

soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(True)])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(id=True)])
pprint([f"{type(i)}::{i.name}" for i in soup.find_all(text=True)])

輸出:

["<class 'bs4.element.Tag'>::html",
 "<class 'bs4.element.Tag'>::head",
 "<class 'bs4.element.Tag'>::title",
 "<class 'bs4.element.Tag'>::body",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::b",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::p"]
["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None",
 "<class 'bs4.element.NavigableString'>::None"]

函數

過濾器可以是某個函數(或任何可調用對象):

以 tag 節點爲篩選對象時，過濾器函數需以 tag 節點作爲參數，如果函數返回 True，則保留該 tag 節點，否則拋棄該節點。

示例 - 篩選出含 class 屬性，但不含 id 屬性的 tag 節點:

def has_class_but_no_id(tag):
    # Here’s a function that returns True if a tag defines the “class” attribute but doesn’t define the “id” attribute
    return tag.has_attr('class') and not tag.has_attr('id')


soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(has_class_but_no_id)
pprint([f"{type(i)}::{i.name}" for i in tag])

輸出:

["<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p"]

針對 HTML 屬性進行篩選時，過濾函數需以屬性值作爲參數，而非整個 tag 節點。如果 tag 節點包含目標屬性，則會向過濾函數傳遞 None，否則傳遞實際值。如果函數返回 True，則保留該 tag 節點，否則拋棄該節點。

def not_lacie(href):
    # Here’s a function that finds all a tags whose href attribute does not match a regular expression
    return href and not re.compile("lacie").search(href)


soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(href=not_lacie)
for i in tag:
    print(f"{type(i)}::{i.name}::{i}")

輸出:

<class 'bs4.element.Tag'>::a::<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
<class 'bs4.element.Tag'>::a::<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

針對 HTML 文本進行篩選時，過濾需以文本值作爲參數，而非整個 tag 節點。如果函數返回 True，則保留該 tag 節點，否則拋棄該節點。

def func(text):
    return text == "Lacie"

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i}" for i in soup.find_all(text=func)])

輸出:

["<class 'bs4.element.NavigableString'>::Lacie"]

過濾函數可以被設計的非常複雜，比如:

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

def surrounded_by_strings(tag):
    # returns True if a tag is surrounded by string objects
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

soup = BeautifulSoup(html_doc, 'html.parser')
tag = soup.find_all(surrounded_by_strings)
pprint([f"{type(i)}::{i.name}" for i in tag])
# 注意空白符對輸出結果的影響

輸出:

["<class 'bs4.element.Tag'>::body",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::p"]

find_all()🔨

🔨find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs)

該方法會檢索當前 tag 對象的所有子孫節點，並提取與給定條件匹配的所有節點對象，然後返回一個包含這些節點對象的列表。

name 參數

name 是用來篩選 tag 名稱的過濾器，find_all() 會保留與 name 過濾器匹配的 tag 對象。使用 name 參數時，會自動過濾 HTML 文本節點，因爲文本節點的 .name 字段爲 None。

前面提到的五種過濾器均可用作 name 參數，即字符串、正則表達式、列表、True、函數(可調用對象)。

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i.name}" for i in soup.find_all('title')])
#> ["<class 'bs4.element.Tag'>::title"]

**kwargs 參數

函數定義中未包含的關鍵字參數將被視作 HTML 屬性過濾器，find_all() 會保留屬性值與 var-keyword 匹配的 tag 對象。使用 var-keyword 時，會自動過濾 HTML 文本節點，因爲文本節不含有 HTML 屬性。

前面提到的五種過濾器均可用作 var-keyword 的值，即字符串、正則表達式、列表、True、函數(可調用對象)。

soup = BeautifulSoup(html_doc, 'html.parser')
# 搜索id值爲link2的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(id='link2')])
# 搜索href值以字母'e'結尾的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(href=re.compile(r"e$"))])
# 搜索具備id屬性的tag節點
print([f"{type(i)}::{i.name}" for i in soup.find_all(id=True)])
# 過濾多個HTML屬性
print([
    f"{type(i)}::{i.name}"
    for i in soup.find_all(class_="sister", href=re.compile(r"tillie"))
])

輸出:

["<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a"]

string

var-keyword 參數 string 與 text 參數等效:

soup = BeautifulSoup(html_doc, 'html.parser')
print([f"{type(i)}::{i}" for i in soup.find_all(string=re.compile("sisters"))])
#> ["<class 'bs4.element.NavigableString'>::Once upon a time there were three little sisters; and their names were\n        "]
print([f"{type(i)}::{i}" for i in soup.find_all(text=re.compile("sisters"))])
#> ["<class 'bs4.element.NavigableString'>::Once upon a time there were three little sisters; and their names were\n        "]

string 是在 Beautiful Soup 4.4.0 中新加入的，在之前的版本中只能使用 text 參數。

例外

HTML 5 中的部分屬性並不符合 Python 的命名規則，不能用作 var-keyword 參數，此時需要使用 attrs 參數來過濾這些屬性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
#> SyntaxError: keyword can't be an expression

print([
    f"{type(i)}::{i.name}"
    for i in data_soup.find_all(attrs={"data-foo": "value"})
])
#> ["<class 'bs4.element.Tag'>::div"

var-keyword 參數不能用於過濾 HTML tag 的 name 屬性，因爲在 find_all() 的函數定義中已佔用了變量名 name。如果要過濾 name 屬性，可使用 attrs 參數來完成。

soup = BeautifulSoup(html_doc, 'html.parser')
name_soup = BeautifulSoup('<input name="email"/>', 'html.parser')
print([f"{type(i)}::{i.name}" for i in name_soup.find_all(name="email")])
print([
    f"{type(i)}::{i.name}" for i in name_soup.find_all(attrs={"name": "email"})
])

輸出:

[]
["<class 'bs4.element.Tag'>::input"]

按 CSS 類搜索

CSS 的 class 屬性是 Python 的保留關鍵字，從 BeautifulSoup 4.1.2 開始，可使用 var-keyword 參數 class_ 來篩選 CSS 的 class 屬性。使用 var-keyword 時，會自動過濾 HTML 文本節點，因爲文本節不含有 HTML 屬性。

前面提到的五種過濾器均可用作 class_ 的值，即字符串、正則表達式、列表、True、函數(可調用對象)。

# 搜索class時sister的a標籤
soup = BeautifulSoup(html_doc, 'html.parser')
pprint([f"{type(i)}::{i.name}" for i in soup.find_all("a", class_="sister")])

# 搜索class中包含itl字段的標籤
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=re.compile("itl"))])

def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
# 搜索class值長度爲6的標籤
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=has_six_characters)])

pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(class_=['title', "story"])])

輸出:

["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::p"]
["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p",
 "<class 'bs4.element.Tag'>::p"]

CSS 的 class 屬性可能會包含多個值，如果 class_ 僅匹配單個值，則會篩選出所有包含此 CSS class 的 tag 標籤；如果 class_ 匹配多個值時，會嚴格按照 CSS class 的順序進行匹配，即使內容完全一樣，但順序不一致也會匹配失敗:

css_soup = BeautifulSoup('<p class="body strikeout"></p>', 'html.parser')
print(css_soup.find_all(class_='body'))
#> [<p class="body strikeout"></p>]
print(css_soup.find_all(class_='strikeout'))
#> [<p class="body strikeout"></p>]

print(css_soup.find_all("p", class_="body strikeout"))
#> [<p class="body strikeout"></p>]
print(css_soup.find_all("p", class_="strikeout body"))
#> []

因此，當你想要依據多個 CSS class 來搜索需要的 tag 標籤時，爲了不免因順序不一致而搜索失敗，應使用 CSS 選擇器:

print(css_soup.select("p.strikeout.body"))
#> [<p class="body strikeout"></p>]

在 BeautifulSoup 4.1.2 之前不能使用 class_ 參數，此時可通過 attrs 參數來完成搜索:

soup = BeautifulSoup(html_doc, 'html.parser')
pprint(
    [f"{type(i)}::{i.name}" for i in soup.find_all(attrs={"class": "sister"})])

pprint([f"{type(i)}::{i.name}" for i in soup.find_all(attrs="sister")])

輸出:

["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]
["<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a",
 "<class 'bs4.element.Tag'>::a"]

attrs 參數

可以向 attrs 傳遞以下兩種類型的實參值:

過濾器 - 此時 .find_all() 會查找 CSS class 的值與該過濾器匹配的 tag 標籤，前面提到的五種過濾器均可使用。

soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all("p", "title"))
#> [<p class="title"><b>The Dormouse's story</b></p>]

print([f"{type(i)}::{i.name}" for i in soup.find_all(attrs="sister")])
#> ["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]

映射對象 - .find_all() 會把映射對象中的鍵值對視作 HTML 屬性名和屬性值，並找出擁有配匹屬性的 tag 標籤，前面提到的五種過濾器均可用作映射對象的值。

soup = BeautifulSoup(html_doc, 'html.parser')

pprint([
    f"{type(i)}::{i.name}" for i in soup.find_all(attrs={
        "class": "sister",
        "id": "link1",
    })
])
#> ["<class 'bs4.element.Tag'>::a"]

text/string 參數

The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text

text 是用來篩選文本標籤的過濾器，find_all() 會保留與 text 過濾器匹配的文本標籤，前面提到的五種過濾器均可用作 text 的實參。

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.find_all(string="Elsie"))
print(soup.find_all(string=["Tillie", "Elsie", "Lacie"]))
print(soup.find_all(string=re.compile("Dormouse")))


def is_the_only_string_within_a_tag(s):
    """Return True if this string is the only child of its parent tag."""
    return (s == s.parent.string)


print(soup.find_all(string=is_the_only_string_within_a_tag))

輸出:

['Elsie']
['Elsie', 'Lacie', 'Tillie']
["The Dormouse's story", "The Dormouse's story"]
["The Dormouse's story", "The Dormouse's story", 'Elsie', 'Lacie', 'Tillie', '...']

在查找 tag 標籤時，text 被視作篩選條件，find_all() 會篩選出 .string 字段與 text 過濾器匹配的 tag 標籤:

soup = BeautifulSoup(html_doc, 'html.parser')

print([f'{type(i)}::{i}' for i in soup.find_all("a", string="Elsie")])
#> ['<class \'bs4.element.Tag\'>::<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>']

limit 參數

默認情況下 find_all() 會返回所有匹配到的標籤對象，如果並不需要獲取全部標籤對象，可使用 limit 參數來控制對象的數量，此時 BeautifulSoup 會在搜索到 limit 個標籤對象後停止搜索。

soup = BeautifulSoup(html_doc, 'html.parser')
# There are three links in the “three sisters” document,
# but this code only finds the first two
print([f'{type(i)}::{i.name}' for i in soup.find_all("a", limit=2)])
#> ["<class 'bs4.element.Tag'>::a", "<class 'bs4.element.Tag'>::a"]

recursive 參數

默認情況下 find_all() 會檢索當前 tag 對象的所有子孫節點，並提取與給定條件匹配的所有節點對象，然後返回一個包含這些節點對象的列表。如果不想遞歸檢索所有子孫節點，可使用 recursive 進行限制: 當 recursive=False 時，只會檢索直接子節點:

soup = BeautifulSoup(html_doc, 'html.parser')

print([f'{type(i)}::{i.name}' for i in soup.find_all("title")])
#> ["<class 'bs4.element.Tag'>::title"]
print(
    [f'{type(i)}::{i.name}' for i in soup.find_all("title", recursive=False)])
#> []

調用 `Tag` 對象

在使用 BeautifulSoup 時，find_all() 是最常用的檢索方法，因此開發人員爲 find_all() 提供了更簡便的調用方法——我們在調用 Tag 對象時，便是在調用其 find_all() 方法，源代碼如下:

def __call__(self, *args, **kwargs):
    """Calling a tag like a function is the same as calling its
        find_all() method. Eg. tag('a') returns a list of all the A tags
        found within this tag."""
    return self.find_all(*args, **kwargs)

示例 :

soup("a") # 等效於soup.find_all("a")
soup.title(string=True) # 等效於soup.title.find_all(string=True)

find()🔨

🔨find(name, attrs, recursive, string, **kwargs)

find() 方法會只會返回第一個被匹配到的標籤對象，如果沒有與之匹配的標籤則會返回 None。在解析樹中使用節點名稱導航時，實際上就是在使用 find() 方法。

其它搜索方法

在理解下面這些方法時，請交叉參考筆記﹝BeautifulSoup - 解析樹.md﹞中的"在解析樹中導航"一節，以便理解解析樹的結構。
本節中不會詳細解釋各個方法的含義，只會給出函數簽名和文檔參考連接。

find_parents()&find_parent()🔨

🔨find_parents(name, attrs, string, limit, **kwargs)

🔨find_parent(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_next_siblings()&find_next_sibling()🔨

🔨find_next_siblings(name, attrs, string, limit, **kwargs)

🔨find_next_sibling(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_previous_siblings()&find_previous_sibling()🔨

🔨find_previous_siblings(name, attrs, string, limit, **kwargs)

🔨find_previous_sibling(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_all_next()&find_next()🔨

🔨find_all_next(name, attrs, string, limit, **kwargs)

🔨find_next(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

find_all_previous()&find_previous()🔨

🔨find_all_previous(name, attrs, string, limit, **kwargs)

🔨find_previous(name, attrs, string, **kwargs)

詳見: https://www.crummy.com/softwa...

CSS 選擇器

從 4.7.0 版本開始，BeautifulSoup 將通過 SoupSieve 項目支持大多數 CSS4 選擇器。如果你通過 pip 來安裝 BeautifulSoup，則會自動安裝 SoupSieve。

SoupSieve 的官方文檔中詳細介紹了 API 和目前已支持的 CSS 選擇器，API 不只包含本節介紹的 .select()，還包含以下方法:

.select_one()
.iselect()
.closest()
.match()
.filter()
.comments()
.icomments()
.escape()
.compile()
.purge()

總之，如需全面瞭解 SoupSieve 相關信息，請參考其官方文檔。

在瞭解 CSS 時，推薦使用"jQuery 選擇器檢測器"來觀察不同的選擇器的效果，還可交叉參考筆記﹝PyQuery.md﹞和以下連接:

select()🔨

.select() 方法適用於 BeautifulSoup 對象和 Tag 對象。

在 4.7.0 版本之後， .select() 會使用 SoupSieve 來提取與 CSS 選擇器匹配的所有節點對象，然後返回一個包含這些節點對象的列表。

在 4.7.0 版本之前，雖然也可以使用 .select()，但是在舊版本中僅支持最常見的 CSS 選擇器。

元素選擇器:

print(soup.select("title"))
#> [<title>The Dormouse's story</title>]

print(soup.select("p:nth-of-type(3)"))
#> [<p class="story">...</p>]

嵌套選擇器:

print(soup.select("body a"))
#> [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#>  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#>  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

print(soup.select("html head title"))
#> [<title>The Dormouse's story</title>]

更多示例詳見: https://www.crummy.com/softwa...

BeautifulSoup 使用指北 - 0x03_搜索解析樹

概述

Three sisters

過濾器

字符串

正則表達式

列表

True

函數

find_all()🔨

name 參數

**kwargs 參數

string

例外

按 CSS 類搜索

attrs 參數

text/string 參數

limit 參數

recursive 參數

調用 `Tag` 對象

find()🔨

其它搜索方法

find_parents()&find_parent()🔨

find_next_siblings()&find_next_sibling()🔨

find_previous_siblings()&find_previous_sibling()🔨

find_all_next()&find_next()🔨

find_all_previous()&find_previous()🔨

CSS 選擇器

select()🔨

歡迎關注公衆號: import hello

常用的 Git 指令

sm4加密工具類

BeautifulSoup 指北_概覽

operator﹝Python 標準庫﹞ operator - Standard operators as functions

數據庫API規範 v2.0 (PEP 249)

序列化(serialization)

BeautifulSoup 使用指北 - 0x03_搜索解析樹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

BeautifulSoup 使用指北 - 0x03_搜索解析樹

概述

Three sisters

過濾器

字符串

正則表達式

列表

True

函數

find_all()🔨

name 參數

**kwargs 參數

string

例外

按 CSS 類搜索

attrs 參數

text/string 參數

limit 參數

recursive 參數

調用 Tag 對象

find()🔨

其它搜索方法

find_parents()&find_parent()🔨

find_next_siblings()&find_next_sibling()🔨

find_previous_siblings()&find_previous_sibling()🔨

find_all_next()&find_next()🔨

find_all_previous()&find_previous()🔨

CSS 選擇器

select()🔨

歡迎關注公衆號: import hello

調用 `Tag` 對象