beautiful soup的基本使用

練習的文檔

<head>
<meta charset="utf-8"/>
<title>python Beautiful soup 測試</title>
<link href="images/bitbug_favicon.ico" rel="icon"/>
<link href="css/in-css.css" rel="stylesheet" type="text/css"/>
<link href="css/theme-style.css" rel="stylesheet" type="text/css"/>
<script src="js/jquery-1.8.3.min.js" type="text/javascript"></script>
<script src="js/in-js.js" type="text/javascript"></script>
<script src="js/jquery.min.js" type="text/javascript"></script>
<p class="world">hello world</p>
<p class="python">hello python</p>
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
<a href="http://example.com/lacie" class="brother" id="link2">Lacie</a>
</head>

基本使用

BeautifulSoup 的理解

先做“一鍋湯”,我們使用‘lxml’作爲我們的解析器。
解析器類型：

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup,“html.parser”)	Python的內置標準庫: 1. 執行速度適中文 2. 檔容錯能力強	Python 2.7.3 or 3.2.2)前的版本中文檔容錯能力差
lxml HTML 解析器	BeautifulSoup(markup,“lxml”)	1. 速度快 2. 文檔容錯能力強	需要安裝C語言庫
lxml XML 解析器	1. BeautifulSoup(markup,[“lxml-xml”]) 2. BeautifulSoup(markup,“xml”)	1. 速度快 2. 唯一支持XML的解析器	需要安裝C語言庫
html5lib	BeautifulSoup(markup,“html5lib”)	1. 最好的容錯性 2. 以瀏覽器的方式解析文檔 3. 生成HTML5格式的文檔	1. 速度慢 2. 不依賴外部擴展

soup = BeautifulSoup(doc, 'lxml')

結構化打印輸出

print(soup.prettify())
# 輸出結果
"""
<html>
 <head>
  <meta charset="utf-8"/>
  <title>
   python Beautiful soup 測試
  </title>
  <link href="images/bitbug_favicon.ico" rel="icon"/>
  <link href="css/in-css.css" rel="stylesheet" type="text/css"/>
  <link href="css/theme-style.css" rel="stylesheet" type="text/css"/>
  <script src="js/jquery-1.8.3.min.js" type="text/javascript">
  </script>
  <script src="js/in-js.js" type="text/javascript">
  </script>
  <script src="js/jquery.min.js" type="text/javascript">
  </script>
 </head>
 <body>
  <p class="world">
   hello world
  </p>
  <p class="python">
   hello python
  </p>
  <a class="sister" href="http://example.com/elsie" id="link1">
   Elsie
  </a>
  <a class="brother" href="http://example.com/lacie" id="link2">
   Lacie
  </a>
 </body>
</html>
"""

幾個簡單的瀏覽結構化數據的方法

# 找到標題標籤
print(soup.title)
# 輸出結果
# <title>python Beautiful soup 測試</title>
# 找到鏈接標籤
print(soup.link)
# 打印結果
# <link href="images/bitbug_favicon.ico" rel="icon"/>
# 找到標籤的字符串
print(soup.title.string)
# 打印結果
# python Beautiful soup 測試
# 找標籤(Tag)的屬性(attributes)，我們可以用字典的方式去找到標籤屬性的值。
print(soup.a['class'])
# 或者
print(soup.a.get('class'))
# 打印結果
# ['sister']
# 找到鏈接地址
print(soup.link['href'])
# 打印結果
# images/bitbug_favicon.ico

有時候我們需要爬取的鏈接很多，而不是隻爬取一條鏈接，這時我們需要用soup.find_all()

# 爬取所有link標籤
for url in soup.find_all('link'):
    print(url['href'])
# images/bitbug_favicon.ico
# css/in-css.css
# css/theme-style.css

# 爬取a標籤的所有鏈接
for url in soup.find_all('a'):
    print(url['href'])
 # 打印結果
 # http://example.com/elsie
 # http://example.com/lacie

有時候我們需要精確爬取某條鏈接或者某類鏈接

print(soup.find_all('a', class_='brother'))
# 打印結果
# [<a class="brother" href="http://example.com/lacie" id="link2">Lacie</a>]
# 如果我們需要這條鏈接
print(soup.find_all('a', class_='brother')[0]['href'])
# 打印結果
# http://example.com/lacie
# 如果我們需要這裏面的文字
print(soup.find_all('a', class_='brother')[0].string)
# 打印結果
# Lacie

今天的分享比較簡單，到時再更新。

beautiful soup的基本使用

beautiful soup的基本使用

練習的文檔

基本使用

BeautifulSoup 的理解

結構化打印輸出

幾個簡單的瀏覽結構化數據的方法

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

瞭解python 異常處理，捕獲未知異常以及自定義異常

計算矩陣邊緣元素之和（C語言編程基礎之多維數組）

同行列對角線的格子（C語言編程基礎之多維數組）

python異常集合

數據庫系統概論（第5版）理論習題第一、二章答案

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結