[python爬蟲之路day6]:BeautifulSoup4庫的基本操作&&&常用的CSS選擇器

BeautifulSoup4庫：
這是一個html/xml的解析器，類似前面學過的lxml，但是與之前的相比，更容易使用，因爲每次調用都會載入整個文檔，所以速度較慢。
安裝：
pip install bs4
BeautifulSoup4庫的基本使用：

from bs4 import BeautifulSoup
html=“”“長代碼”“”
bs=BeautifulSoup(html,'lxml')
print(bs.prettify())

其中bs=BeautifulSoup(html,‘lxml’)，lxml是本庫的一種解析器。
注意事項：
1.find_all的使用:

soup=BeautifulSoup(html,'lxml')
#print(bs.prettify())
#print("1")

2.find和find_all的區別
find找到第一個滿足的標籤，find_all找到所有滿足條件的標籤
3.獲取標籤的屬性(兩種方法)：

aList=soup.find_all("a")
for a in aList:
    #1.
    #hers=a["href"]
    #print(hers)
    #2.
    her=a.attrs['href']
    print(her)

4.strings,stripped_strings,string,get_text()的區別使用
string：獲取某個標籤的非標籤字符串，返回字符串,如果標籤下有多個文本，那麼就不能獲取到了。（.contents）
strings：獲取某個標籤的子孫非標籤字符串，返回生成器
stripped_strings:獲取某個標籤的子孫非標籤字符串，除去空格，返回生成器可用list()強制轉換
get_text():
獲取某個標籤的子孫非標籤字符串，返回字符串
部分操作代碼：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
#print(bs.prettify())
#print("1")
#1.獲取所有li標籤
lis=soup.find_all("li")
for li in lis:
    print(li)
    print("*"*30)
#2.獲取第二個li標籤
li=soup.find_all("li",limit=2)[1]
print(li)
#3.獲取所有class=industry的div標籤
lis=soup.find_all("div",class_="industry")  #也可寫成    lis=soup.find_all("li", attrs={'class':"industry"}
for li in lis:
    print(li)
#4.將所有id=test,class=test的a標籤
#alis=soup.find_all("a",id="test",class_="test")等價於
alis=soup.find_all("a",attrs={"id":"test","class_":"test"})
for a in alis:
    print(a)
#5.獲取所有的a標籤的href屬性
aList=soup.find_all("a")
for a in aList:
    #1.
    #hers=a["href"]
    #print(hers)
    #2.
    her=a.attrs['href']
    print(her)
#6.獲取所有的文本
lis=soup.find_all("li")
for li in lis:
    tds=soup.find_all("div")
    title=tds[0].string
    time=tds[1].string
    job=tds[2].string

infos=tr.strings
infos=list(tr.stripped_strings)
movie['title']=infos[0]
movie['time']=infos[1]
movie['job']=infos[2]

#get_text()
lis=soup.find_all("li")
text=lis.get_text()

爬蟲中一些常用的CSS選擇器：
1.根據標籤名選擇，示例如下：


p{
	background-lor:red
}

2.根據類名選擇，要在類名前加上“ . ”,示例如下：

.linn{
	background-color:red
}

3.根據id名選擇，要在前加#，示例如下：

#line3{
	background-color:red
}

4.組合查找，查找子孫元素，要在子孫元素前加一個空格，示例：

.box p{
	background-color:red
}

5.直接查找子元素，在父子之間加“ >”,示例：

.box>p{
	background-color:red
}

6.根據屬性名查找，示例：

input[name='username']{
	background-color:red
}

7.在對類或者id查找時，如果還要根據標籤進行過濾，應該在前面加入標籤名字，示例：

（id）
div#line{
	background-color:red
}
或者(類)
div.line{
	background-color:red
}

在BeautifulSoup中使用css選擇器，使用soup.select(‘字符串’)
1.獲取所有tr標籤：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
trs=soup.select("tr")

2.獲取第二個tr標籤：

soup=BeautifulSoup(html,'lxml')
trs=soup.select("tr")[1]

3.獲取所有class是even的tr標籤:

soup=BeautifulSoup(html,'lxml')
trs=soup.select("tr.even")
trs=soup.select(tr[class]='even')

4.獲取所有a標籤的href：

soup=BeautifulSoup(html,'lxml')
alist=soup.select("a")
for a in alist:
	href=a['hred']
	print(href)

常見的四種對象：
1.Tag:BeautifulSoup的所有標籤都是Tag類型，且BeautifulSoup的對象都是Tag類型，一些方法比如：find,find_al，並不是BeautifulSoup類型，而是Tag類型
2.NavgableString：繼承python的str，與python中的str使用一致。
3.BeautifulSoup：繼承Tag，用來從生產BeautifulSoup樹，一些方法其實也是Tag,比如find_all, select等。
4.Comment:繼承NavgableString。
“.contents"和”.chirldren"
返回標籤下的直接子元素，包括字符串，區別是.contents返回的是列表，.chirldren返回的是迭代器。

[python爬蟲之路day6]:BeautifulSoup4庫的基本操作&&&常用的CSS選擇器

推薦2款開源、美觀的WinForm UI控件庫

NET9 AspnetCore將整合OpenAPI的文檔生成功能而無需三方庫

[LaTeX科研第一步]：我用LaTeX整理了一份LaTeX極速入門手冊，分享給大家~

[LaTeX科研入門07]：多行公式的寫入

[LaTeX科研入門08]：極速設置參考文獻

[LaTeX科研入門06]：數學矩陣

[python爬蟲之路day8]:正則表達式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結