BeautifulSoup是python解析html非常好用的第三方庫！

安裝

pip install beautifulsoup4

導入

from bs4 import BeautifulSoup

解析庫

BeautifulSoup默認支持Python的標準HTML解析庫，但是它也支持一些第三方的解析庫：

使用技巧

有這樣一個網頁html：

<!DOCTYPE html>
<!--STATUS OK-->
<html>
 <head>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta content="always" name="referrer"/>
  <link href="https://ss1.bdstatic.com/5eN1bjq8AAUYm2zgoY3K/r/www/cache/bdorz/baidu.min.css" rel="stylesheet" type="text/css"/>
  <title>
   百度一下，你就知道 </title>
 </head>
 <body link="#0000cc">
  <div id="wrapper">
   <div id="head">
    <div class="head_wrapper">
     <div id="u1">
      <a class="mnav" href="http://news.baidu.com" name="tj_trnews">
       新聞 </a>
      <a class="mnav" href="https://www.hao123.com" name="tj_trhao123">
       hao123 </a>
      <a class="mnav" href="http://map.baidu.com" name="tj_trmap">
       地圖 </a>
      <a class="mnav" href="http://v.baidu.com" name="tj_trvideo">
       視頻 </a>
      <a class="mnav" href="http://tieba.baidu.com" name="tj_trtieba">
       貼吧 </a>
      <a class="bri" href="//www.baidu.com/more/" name="tj_briicon" style="display: block;">
       更多產品 </a>
     </div>
    </div>
   </div>
  </div>
 </body>
</html>

# 創建beautifulsoup4對象
from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")

# 縮進格式
print(bs.prettify()) 

# 獲取title標籤的所有內容
print(bs.title) 

# 獲取title標籤的名稱
print(bs.title.name) 

# 獲取title標籤的文本內容
print(bs.title.string) 

# 獲取head標籤的所有內容
print(bs.head) 

# 獲取第一個div標籤中的所有內容
print(bs.div) 

# 獲取第一個div標籤的id的值
print(bs.div["id"])

# 獲取第一個a標籤中的所有內容
print(bs.a) 

# 獲取所有的a標籤中的所有內容
print(bs.find_all("a"))

# 獲取id="u1"
print(bs.find(id="u1")) 

# 獲取所有的a標籤，並遍歷打印a標籤中的href的值
for item in bs.find_all("a"): 
	print(item.get("href")) 
	
# 獲取所有的a標籤，並遍歷打印a標籤的文本值
for item in bs.find_all("a"): 
	print(item.get_text()) // 等同於print(item.string)

BeautifulSoup4四大對象種類

BeautifulSoup4將複雜HTML文檔轉換成一個複雜的樹形結構,每個節點都是Python對象,所有對象可以歸納爲4種:

Tag、NavigableString、BeautifulSoup、Comment

Tag:Tag通俗點講就是HTML中的一個個標籤

from bs4 import BeautifulSoup
bs = BeautifulSoup(html,"html.parser")

# 獲取title標籤的所有內容
print(bs.title)

# 獲取head標籤的所有內容
print(bs.head)

# 獲取第一個a標籤的所有內容
print(bs.a)

# 類型
print(type(bs.a))

我們可以利用 soup 加標籤名輕鬆地獲取這些標籤的內容，這些對象的類型是 bs4.element.Tag。
但是注意，它查找的是在所有內容中的第一個符合要求的標籤。

對於 Tag，它有兩個重要的屬性，是 name 和 attrs：

# [document] 
#bs 對象本身比較特殊，它的 name 即爲 [document]
print(bs.name)
# head 
#對於其他內部標籤，輸出的值便爲標籤本身的名稱
print(bs.head.name) 

# 在這裏，我們把 a 標籤的所有屬性打印輸出了出來，得到的類型是一個字典。
print(bs.a.attrs) # 常用

#還可以利用get方法，傳入屬性的名稱，二者是等價的
print(bs.a['class']) # bs.a.get('class')

# 可以對這些屬性和內容等等進行修改
bs.a['class'] = "newClass"
print(bs.a) 

# 還可以對這個屬性進行刪除
del bs.a['class'] 
print(bs.a)

NavigableString:獲取標籤內部的文字用 .string 即可 (常用)

print(bs.title.string)
print(type(bs.title.string))

BeautifulSoup:表示的是一個文檔的內容

大部分時候，可以把它當作 Tag 對象，是一個特殊的 Tag，我們可以分別獲取它的類型，名稱，以及屬性，例如：

print(type(bs.name))
print(bs.name)
print(bs.attrs)

Comment:是一個特殊類型的 NavigableString 對象，其輸出的內容不包括註釋符號

print(bs.a)  # 此時不能出現空格和換行符，a標籤如下：
# <a class="mnav" href="http://news.baidu.com" name="tj_trnews"><!--新聞--></a>

print(bs.a.string) # 新聞
print(type(bs.a.string)) # <class 'bs4.element.Comment'>

遍歷文檔樹

.contents：獲取Tag的所有子節點，返回一個list

# tag的.content 屬性可以將tag的子節點以列表的方式輸出
print(bs.head.contents)

# 用列表索引來獲取它的某一個元素
print(bs.head.contents[1])

.children：獲取Tag的所有子節點，返回一個生成器

for child in bs.body.children:
    print(child)

搜索文檔樹

find_all(name, attrs, recursive, text, **kwargs)

name參數：

字符串過濾：會查找與字符串完全匹配的內容，返回的是標籤

a_list = bs.find_all("a")
print(a_list)

正則表達式過濾：如果傳入的是正則表達式，那麼BeautifulSoup4會通過search()來匹配內容

t_list = bs.find_all(re.compile("a"))
for item in t_list:
    print(item)

列表：如果傳入一個列表，BeautifulSoup4將會與列表中的任一元素匹配到的節點返回

t_list = bs.find_all(["meta","link"])
for item in t_list:
    print(item)

方法：傳入一個方法，根據方法來匹配

def name_is_exists(tag):
    return tag.has_attr("name")

t_list = bs.find_all(name_is_exists)
for item in t_list:
    print(item)

kwargs參數

# 查詢id=head的Tag
t_list = bs.find_all(id="head")
print(t_list)

# 查詢href屬性包含ss1.bdstatic.com的Tag
t_list = bs.find_all(href=re.compile("http://news.baidu.com"))
print(t_list)

# 查詢所有包含class的Tag(注意：class在Python中屬於關鍵字，所以加_以示區別)
t_list = bs.find_all(class_=True)
for item in t_list:
    print(item)

attrs參數

並不是所有的屬性都可以使用上面這種方式進行搜索，比如HTML的data-*屬性，我們可以使用attrs參數，定義一個字典來搜索包含特殊屬性的tag：

t_list = bs.find_all(attrs={"data-foo":"value"})
for item in t_list:
    print(item)

text參數

通過text參數可以搜索文檔中的字符串內容，與name參數的可選值一樣。
text參數接受字符串，正則表達式，列表

t_list = bs.find_all(attrs={"data-foo": "value"})
for item in t_list:
    print(item)

t_list = bs.find_all(text="hao123")
for item in t_list:
    print(item)

t_list = bs.find_all(text=["hao123", "地圖", "貼吧"])
for item in t_list:
    print(item)

t_list = bs.find_all(text=re.compile("\d"))
for item in t_list:
    print(item)

limit參數

傳入一個limit參數來限制返回的數量
例如下列放回數量爲2

t_list = bs.find_all("a",limit=2)
for item in t_list:
    print(item)

find()

返回符合條件的第一個Tag
即當我們要取一個值的時候就可以用這個方法

t = bs.div.div

# 等價於
t = bs.find("div").find("div")

CSS選擇器

## 通過標籤名查找
print(bs.select('title'))
print(bs.select('a'))

# 通過類名查找
print(bs.select('.mnav'))

# 通過id查找
print(bs.select('#u1'))

# 組合查找
print(bs.select('div .bri'))

# 屬性查找
print(bs.select('a[class="bri"]'))
print(bs.select('a[href="http://tieba.baidu.com"]'))

# 獲取內容
t_list = bs.select("title")
print(bs.select('title')[0].get_text())

python爬蟲之BeautifulSoup4基礎教程

文章目錄

安裝

導入

解析庫

使用技巧

BeautifulSoup4四大對象種類

Tag:Tag通俗點講就是HTML中的一個個標籤

NavigableString:獲取標籤內部的文字用 .string 即可 (常用)

BeautifulSoup:表示的是一個文檔的內容

Comment:是一個特殊類型的 NavigableString 對象，其輸出的內容不包括註釋符號

遍歷文檔樹

.contents：獲取Tag的所有子節點，返回一個list

.children：獲取Tag的所有子節點，返回一個生成器

搜索文檔樹

find_all(name, attrs, recursive, text, **kwargs)

name參數：

kwargs參數

attrs參數

text參數

limit參數

find()

CSS選擇器

HTML頁面關於高分屏的設置

北歐瑞典挪威芬蘭瑞士TikTok海外網紅與YouTube博主的合作模式

歐洲英國德國法國TikTok與YouTube海外網紅達人的完美合作策略

druid數據源 xml配置

一個基於protobuf的極簡RPC

libevent eventbuffer讀寫水位思考

printf() 輸出控制符

Redis 禁止使用耗時命令和時間複雜度爲O(n)的命令

Qt VS中設置.ui文件的生成的.h的目錄

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結