Python3爬蟲從零開始：Beautiful Soup的使用

中文文檔：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id49

基本用法

實例1：

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

print(soup.prettify())

print(type(soup))

print(soup.title.string)

結果：

說明1：

可以看到，我們輸入的並不是一個完整的HTML字符串，缺少了</body>等標籤，我們初始化BeautifulSoup時完成了自動更正格式。

說明2：

soup.prettify()方法將xml/html標籤獨佔一行：https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/#id49

節點選擇器

實例2：提取信息和嵌套選擇

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = 'title' name = "dromouse"><b>The Dormouse's story</b></p>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

#獲取節點#

print('soutp.title:',soup.title) #選擇元素

print('soup.a:',soup.a) #取得的只是第一個a節點

#獲取名稱#

print('soup.title.name:',soup.title.name) # .name獲取名稱

#獲取屬性#

print('soup.p.attrs:',soup.p.attrs) # .attrs獲取所有屬性

print('soup.p.attrs["name"]:',soup.p.attrs['name']) #獲取限定屬性

print('soup.p["name]:',soup.p['name']) #更簡單的寫法

#獲取內容#

print('soup.p.string:',soup.p.string)

#嵌套選擇#

print('soup.head.title:',soup.head.title)

結果

實例3：子節點

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

for i,content in enumerate(soup.p.contents):

    print(i,content)

for i,child in enumerate(soup.p.children):

    print(i,child)

#遍歷輸出是一致的。

print('contents:',soup.p.contents)

print('children:',soup.p.children)

print('type of contents:',type(soup.p.contents))

print('type of children:',type(soup.p.children))

結果：

說明：通過contents屬性和children屬性都能獲取直接子節點，但注意兩者區別。

實例4：子孫節點

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

<a href = "http://example.com/lacie" class = "sister" id = "link2">Lacie</a> and

<a href = "http://example.com/tillie" class = "sister" id = "link3">Tillie</a>;

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

for i,content in enumerate(soup.p.descendants):

    print(i,content)

print('type:',type(soup.p.descendants))

結果：

實例5：父節點和祖先節點

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p class = "story">Once upon a time there were three little sisters;and

their names were

<a href = "http://example.com/elsie" class = "sister" id = "link1"><!--Elsie--></a>,

and they lived at the bottom of a well.</p>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

print(list(enumerate(soup.p.parents)))

print(soup.p.parent)

print(type(soup.p.parents))

print(type(soup.p.children))

print(type(soup.p.contents))

print(type(soup.p.descendants))

結果：

說明1：

parent屬性獲取的是父節點，parents屬性獲取的是所有的祖先節點。

說明2：

區分不同屬性的類型。

實例6：兄弟選擇器

from bs4 import BeautifulSoup

html ="""

<html><head><title>The Dormouse's story</title></head>

<body>

<p>hahaha</p>

Leraning

<a>C++</a>

HELLO

<a>Java</a>

World

<a>Python</a>

<a>JS</a>

<p class = "story">...<p>

"""

soup = BeautifulSoup(html,'lxml')

print('Next Sibling:',soup.a.next_sibling)

print('Prev Sibling:',soup.a.previous_sibling)

print('Next Siblings',list(enumerate(soup.a.next_siblings)))

print('Prev Siblings',list(enumerate(soup.a.previous_siblings)))

結果：

實例7：方法選擇器

find_all() API如下：find_all(name,attrs,recursive,text,**kwargs)

from bs4 import BeautifulSoup

html ="""

<div class = "C1">

<div class = "C2">

<h1>Hello</h1>

</div>

<div class = "C3">

<ul class = "U1" id = "list1">

<li class = "element">C++</li>

<li class = "element">Java</li>

</ul>

<ul class ="U2" id = "list2">

<li class = "element">Python></li>

</ul>

</div>

</div>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.find_all(name='ul')) #根據name查詢元素

print()

print(soup.find_all(attrs={'class':'element'})) #根據attrs查詢元素，參數類型是字典型

print()

print(soup.find_all(id='list2')) #另一種寫法

print()

print(soup.find_all(class_='element')) #class是Python中關鍵字，這裏記得加_

結果：

補充：除了find_all()方法，還有find()方法，返回第一個匹配的元素。

實例8：CSS選擇器

from bs4 import BeautifulSoup

html ="""

<div class = "C1">

<div class = "C2">

<h1>Hello</h1>

</div>

<div class = "C3">

<ul class = "U1" id = "list1">

<li class = "element">C++</li>

<li class = "element">Java</li>

</ul>

<ul class ="U2" id = "list2">

<li class = "element">Python</li>

</ul>

</div>

</div>

"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')

print(soup.select('.U1')) #選擇器選擇

print()

print(soup.select('ul li')[0]) #限定選擇

print()

print(soup.select('ul li')[0].attrs['class']) #獲取屬性

print()

print(soup.select('li')[2].string) #獲取文本

print()

print(soup.select('li')[2].get_text()) #另一個獲取文本方法

結果：

Python3爬蟲從零開始：Beautiful Soup的使用

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

git clone速度太慢解決方法 git clone 顯著提速，解決Github代碼拉取速度緩慢問題

KLEE安裝

CMake是什麼？有什麼用？ CMake是什麼？有什麼用？

windows下使用cmake+mingw配置makefile windows下使用cmake+mingw配置makefile(一)

CMake指令 CMake簡單指令：CMake學習筆記（一）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結