Python爬蟲庫BeautifulSoup獲取對象(標籤)名,屬性,內容,註釋

如何利用Python爬蟲庫BeautifulSoup獲取對象(標籤)名,屬性,內容,註釋等操作下面就爲大家介紹一下
一、Tag(標籤)對象

1.Tag對象與XML或HTML原生文檔中的tag相同。

from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml')
tag = soup.b
type(tag)

bs4.element.Tag

2.Tag的Name屬性

每個tag都有自己的名字，通過.name來獲取

tag.name

'b'

tag.name = "blockquote" # 對原始文檔進行修改
tag

<blockquote class="boldest">Extremely bold</blockquote>

3.Tag的Attributes屬性

獲取單個屬性

tag['class']

['boldest']

按字典的方式獲取全部屬性

tag.attrs

{'class': ['boldest']}

添加屬性

tag['class'] = 'verybold'
tag['id'] = 1
print(tag)

<blockquote class="verybold" id="1">Extremely bold</blockquote>

刪除屬性

del tag['class']
del tag['id']
tag

<blockquote>Extremely bold</blockquote>

4.Tag的多值屬性

多值屬性會返回一個列表

css_soup = BeautifulSoup('<p class="body strikeout"></p>','lxml')
print(css_soup.p['class'])

['body', 'strikeout']

rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>','lxml')
print(rel_soup.a['rel'])
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)

['index']
<p>Back to the <a rel="index contents">homepage</a></p>

如果轉換的文檔是XML格式，那麼tag中不包含多值屬性

xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']

```bash

‘body strikeout’

二、可遍歷字符串(NavigableString)

1.字符串常被包含在tag內，使用NavigableString類來包裝tag中的字符串

```bash
from bs4 import BeautifulSoup
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','lxml')
tag = soup.b
print(tag.string)
print(type(tag.string))

Extremely bold
<class 'bs4.element.NavigableString'>

2.一個 NavigableString 字符串與Python中的str字符串相同，通過str() 方法可以直接將 NavigableString 對象轉換成str字符串

unicode_string = str(tag.string)
print(unicode_string)
print(type(unicode_string))

Extremely bold
<class 'str'>

3.tag中包含的字符串不能編輯,但是可以被替換成其它的字符串,用 replace_with() 方法

tag.string.replace_with("No longer bold")
tag

<b class="boldest">No longer bold</b>

三、BeautifulSoup對象 BeautifulSoup 對象表示的是一個文檔的全部內容。

大部分時候,可以把它當作 Tag 對象,它支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法。

四、註釋與特殊字符串(Comment)對象

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup,'lxml')
comment = soup.b.string
type(comment)

bs4.element.Comment

Comment 對象是一個特殊類型的 NavigableString 對象

comment

'Hey, buddy. Want to buy a used parser?'

推薦我們的python學習基地，看前輩們是如何學習的！從基礎的python腳本、爬蟲、django、數據挖掘等編程技術，還有整理零基礎到項目實戰的資料，送給每一位愛學習python的小夥伴！每天都有老前輩定時講解Python技術，分享一些學習的方法和需要留意的小細節，點擊加入我們的 python學習者聚集地

python進步學習者

發佈了33 篇原創文章 · 獲贊 35 · 訪問量 4萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲庫BeautifulSoup獲取對象(標籤)名,屬性,內容,註釋

開源高性能結構化日誌模塊NanoLog

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

python基礎教程：利用turtle庫繪製笑臉和哭臉的例子

python 怎樣將dataframe中的字符串日期轉化爲日期的方法

python基礎教程：通過Turtle庫在Python中繪製一個鼠年福鼠

python如何實現異步調用函數執行

詳解Python列表賦值複製深拷貝及5種淺拷貝

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結