BeautifulSoup是用來從HTML或XML中提取數據的Python庫。對於不具備良好格式的 HTML 內容，lxml 提供了兩個有用的包：lxml.html 模塊和 BeautifulSoup 解析器。

Beautiful Soup提供一些簡單的、python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序

BeautifulSoup4 安裝命令

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple BeautifulSoup4

先給出一個學習實例: 取名叫 hezhi.html ，內容最下方給出，直接複製內容，保存到hezhi.html即可

一個最基礎的BeautifulSoup事例。

from bs4 import BeautifulSoup
soup = BeautifulSoup( "<p>這是一個html的P標籤</p>", "html.parser" )
print(soup)

然後開始我們的hezhi.html例子演示:

首先把我們的hezhi.html放到一個固定的位置吧，對文件讀取操作不是很熟練的可以直接放在D盤即可。

我在這以D盤爲例:我直接放到了D盤根目錄。

開始我萌的實驗之旅。

第一個:我萌要看的函數 prettify()

prettify()的意思是美化，就是進行格式化輸出，而不會把hezhi.html的內容當成一團可讀性極差的內容，糟糕的輸出。

from bs4 import BeautifulSoup
#讀取hezhi.html的代碼
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close

#用BeautifulSoup格式美化輸出
soup = BeautifulSoup( demo , "html.parser")
print( soup.prettify() )

第二個:開始有分析數據的味兒了，以標籤爲單位尋找信息。

1.找到第一個要求類型的標籤

from bs4 import BeautifulSoup
#讀取hezhi.html的代碼
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close

# 提取第一個標籤
soup = BeautifulSoup( demo , "html.parser")
print(soup.a) # 輸出第一個a標籤
print(soup.div) # 輸出第一個div標籤
print(soup.input) # 輸出第一個input標籤
print(soup.h1) # 輸出第一個h1標籤

2.找到所有同一種標籤類型的標籤用到函數 find_all(標籤類型)

給了提取input標籤和div標籤的方法，其他的標籤也是同樣的原理

返回類型類似一個List是有索引的。所以可以快速輸出第一個、第二個標籤

from bs4 import BeautifulSoup
#讀取hezhi.html的代碼
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close
# 利用函數,提取同一類型的標籤
soup = BeautifulSoup( demo , "html.parser")

print( soup.find_all("input") )

print( soup.find_all("div") )

# 輸出第一個第二個input標籤
print( soup.find_all("input")[0] )

print( soup.find_all("input")[1] )

3.根據獨特的信息提取標籤：這裏用到字典 attrs = { "class":"third","name":"three" } 每一個屬性對應一個值

from bs4 import BeautifulSoup
#讀取hezhi.html的代碼
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close
# 利用特定信息,提取同一類型的標籤，
soup = BeautifulSoup( demo , "html.parser")

#根據屬性找div
print( soup.find_all("div" , attrs = { "class":"first" } ) )

print( soup.find_all("div" , attrs = { "class":"third","name":"three" } )  )

print( soup.find_all("div" , attrs = { "class":"third","name":"three" } )[0]  )

#根據信息找input
print( soup.find_all("input" , attrs = { "class":"user" } )[0]  )

print( soup.find_all("input" , attrs = { "type":"text" } )[0]  )

print( soup.find_all("input" , attrs = { "type":"text" } )[1]  )

4.標籤的幾個常用屬性，其他標籤都可參照如下內容

from bs4 import BeautifulSoup
#讀取hezhi.html的代碼
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close
# 輸出標籤的屬性
soup = BeautifulSoup( demo , "html.parser")

print( soup.div.attrs )  # 標籤的屬性們，是個字典
print( soup.div.attrs["class"] ) # 輸出標籤的class

print( soup.a.attrs)
print( soup.a.attrs["href"]) #輸出a標籤的鏈接信息

print( soup.div.name )  #輸出div ,如果是soup.a.name就是a
print( soup.div.string ) #可以輸出註釋
print( soup.div.text ) #過濾所有的註釋之外的內容

5.簡單的進階,找到class是second的內容

from bs4 import BeautifulSoup
#讀取hezhi.html的代碼
file = open("D:\\hezhi.html","r",encoding="utf-8")
demo = file.read()
file.close

# 輸出標籤的屬性
soup = BeautifulSoup( demo , "html.parser")

print( soup.find_all("div" , attrs = {"class":"second" } )[0].string )
print( soup.find_all("div" , attrs = {"class":"second" } )[0].name )

#因爲我寫的頁面只存在一個second的所以此時也可等價於如下代碼，find函數是找到第一個
print( soup.find("div" , attrs = {"class":"second" } ).string )
print( soup.find("div" , attrs = {"class":"second" } ).name )

BeautifulSoup總結出來的一些知識：

soup.Tag中的Tag代表各種標籤哈，比如:soup.a 、soup.div 、soup.h1 、soup.input

Tag	標籤	例如
element	要素、含義
soup.Tag	某個標籤所有內容	soup.div soup.a
soup.Tag.attrs	屬性們、字典類型、attrs全稱:attribute屬性	soup.a.attrs
soup.Tag.string	輸出標籤的內容信息	一個字符串
soup.prettify()	美化顯示爬取的信息	soup.prettify()
soup.Tag.name	獲得標籤的名字	div、a
soup.find_all("Tag.name")	獲得所有的Tag標籤,並擁有索引值	soup.find_all("input")
soup.find_all("input",attrs={"type":"text"})	獲得標籤的子標籤,有索引，長度等屬性，是list類型,字典(name,attrs={屬性字典})	soup.find_all("input",attrs={"type":"text"})
soup.a.children	多個孩子，迭代器，沒有索引，沒有長度，可循環遍歷list_iterator
soup.a.parent	獲得自己的父親節點	soup.input.parent
soup.a.parents	迭代器，只能循環編列	soup.a.parents
soup.Tag.contents	獲得標籤的子標籤,有索引，長度等屬性，是list類型	soup.body.contents
soup.Tag.next_sibling	獲得平行節點下一個節點	soup.div.next_sibling
soup.Tag.next_siblings	多個節點,沒有索引，沒有長度，可以循環遍歷，迭代類型
soup.Tag.previous_sibling	獲得平行節點上一個節點
soup.Tag.previous_siblings	多個節點,沒有索引，沒有長度，可以循環遍歷，迭代類型
soup.p.string	返回節點內容，如果多個節點就返回空,會顯示註釋
soup.p.text	除了註釋的內容

hezhi.html在這

<html>
<head>
	<meta charset="utf-8">
	<title>BeautifulSoup</title>	
	<style>
	*{
		margin:0;
		padding:0;
		text-align:center;
	}
	.first{
		background-color:yellow;
	}
	.second{
		background-color:blue;
	}
	.third{
		background-color:green;
	}
	
	</style>
	
</head>

<body>

	<div class="first">I'am the First Div</div>
	
	<div class="second">I'am the Second Div</div>
	
	<div class="third" name="three">
		<H1>I'am a H1</H1>
		<p name="P1">I'am  a  P</p>
	</div>
	
	<form action="#">
		<input class="user" type="text" /> <br/>
		
		<input class="pwd" type="text"  /> <br/>
		
		<input type="submit">	
	</form>
	
	<a class="Tag_A" href="http://www.baidu.com">百度</a>
	<p name="P2" > <!--我是註釋--> </p>
	
</body>

</html>

Python中BeautifulSoup詳解

一個最基礎的BeautifulSoup事例。

然後開始我們的hezhi.html例子演示:

開始我萌的實驗之旅。

第一個:我萌要看的函數 prettify()

第二個:開始有分析數據的味兒了，以標籤爲單位尋找信息。

BeautifulSoup總結出來的一些知識：

hezhi.html在這

關於遊戲付費的一點想法

我通過CKA和CKS啦！

shell學習之常見系統變量

安恆月賽-四月賽web1

[BJDCTF 2nd]簡單注入

[BJDCTF 2nd]假豬套天下第一

[安洵杯 2019]easy_serialize_php

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結