中文文檔:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id49
基本用法
實例1:
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and
<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(type(soup))
print(soup.title.string)
結果:
說明1:
可以看到,我們輸入的並不是一個完整的HTML字符串,缺少了</body>等標籤,我們初始化BeautifulSoup時完成了自動更正格式。
說明2:
soup.prettify()方法將xml/html標籤獨佔一行:https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id49
節點選擇器
實例2:提取信息和嵌套選擇
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and
<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
#獲取節點#
print('soutp.title:',soup.title) #選擇元素
print('soup.a:',soup.a) #取得的只是第一個a節點
#獲取名稱#
print('soup.title.name:',soup.title.name) # .name獲取名稱
#獲取屬性#
print('soup.p.attrs:',soup.p.attrs) # .attrs獲取所有屬性
print('soup.p.attrs["name"]:',soup.p.attrs['name']) #獲取限定屬性
print('soup.p["name]:',soup.p['name']) #更簡單的寫法
#獲取內容#
print('soup.p.string:',soup.p.string)
#嵌套選擇#
print('soup.head.title:',soup.head.title)
結果
實例3:子節點
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and
<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
for i,content in enumerate(soup.p.contents):
print(i,content)
for i,child in enumerate(soup.p.children):
print(i,child)
#遍歷輸出是一致的。
print('contents:',soup.p.contents)
print('children:',soup.p.children)
print('type of contents:',type(soup.p.contents))
print('type of children:',type(soup.p.children))
結果:
說明:通過contents屬性和children屬性都能獲取直接子節點,但注意兩者區別。
實例4:子孫節點
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and
<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
for i,content in enumerate(soup.p.descendants):
print(i,content)
print('type:',type(soup.p.descendants))
結果:
實例5:父節點和祖先節點
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p class = "story">Once upon a time there were three little sisters;and
their names were
<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,
and they lived at the bottom of a well.</p>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.p.parents)))
print(soup.p.parent)
print(type(soup.p.parents))
print(type(soup.p.children))
print(type(soup.p.contents))
print(type(soup.p.descendants))
結果:
說明1:
parent屬性獲取的是父節點,parents屬性獲取的是所有的祖先節點。
說明2:
區分不同屬性的類型。
實例6:兄弟選擇器
from bs4 import BeautifulSoup
html ="""
<html><head><title>The Dormouse's story</title></head>
<body>
<p>hahaha</p>
Leraning
<a>C++</a>
HELLO
<a>Java</a>
World
<a>Python</a>
<a>JS</a>
<p class = "story">...<p>
"""
soup = BeautifulSoup(html,'lxml')
print('Next Sibling:',soup.a.next_sibling)
print('Prev Sibling:',soup.a.previous_sibling)
print('Next Siblings',list(enumerate(soup.a.next_siblings)))
print('Prev Siblings',list(enumerate(soup.a.previous_siblings)))
結果:
實例7:方法選擇器
find_all() API如下:find_all(name,attrs,recursive,text,**kwargs)
from bs4 import BeautifulSoup
html ="""
<div class = "C1">
<div class = "C2">
<h1>Hello</h1>
</div>
<div class = "C3">
<ul class = "U1" id = "list1">
<li class = "element">C++</li>
<li class = "element">Java</li>
</ul>
<ul class ="U2" id = "list2">
<li class = "element">Python></li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(name='ul')) #根據name查詢元素
print()
print(soup.find_all(attrs={'class':'element'})) #根據attrs查詢元素,參數類型是字典型
print()
print(soup.find_all(id='list2')) #另一種寫法
print()
print(soup.find_all(class_='element')) #class是Python中關鍵字,這裏記得加_
結果:
補充:除了find_all()方法,還有find()方法,返回第一個匹配的元素。
實例8:CSS選擇器
from bs4 import BeautifulSoup
html ="""
<div class = "C1">
<div class = "C2">
<h1>Hello</h1>
</div>
<div class = "C3">
<ul class = "U1" id = "list1">
<li class = "element">C++</li>
<li class = "element">Java</li>
</ul>
<ul class ="U2" id = "list2">
<li class = "element">Python</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.U1')) #選擇器選擇
print()
print(soup.select('ul li')[0]) #限定選擇
print()
print(soup.select('ul li')[0].attrs['class']) #獲取屬性
print()
print(soup.select('li')[2].string) #獲取文本
print()
print(soup.select('li')[2].get_text()) #另一個獲取文本方法
結果: