python爬蟲，beatifulsop獲取標籤屬性值（取值）案例

原創

2021-05-07 14:22

前面的案例裏，均採用正則匹配的方式取值

title = re.findall('">(.*?)</a>', i, re.S)[0]#標題
url = re.findall('="(.*?)" target', i, re.S)[0]#地址

這麼寫的容錯能力有限，爬取的數據越多，越容易出現匹配不到內容的情況

這次採用獲取屬性值的方式取值，除非屬性變化，否則基本不會出現錯誤

爬取下圖內鏈接紅色框內文章標題和鏈接

目標內容html結構如下圖

可見，href的值是鏈接，title的值是標題，所以，獲取對應內容的寫法如下

title = i.get("title")#地址
url = i.get("href")#地址

因爲目標數據是通過匹配所有“a”標籤來獲取的，所有有一部分數據並不是本次案例需要的，爲了使爬取的內容更加精簡，所以對soup.find_all的匹配規則進行的補充

以前是直接寫成“results = soup.find_all('a')”，後發現目標數列裏有共同的“target='_blank'”內容，其他“a”內沒有，所可以寫成“results = soup.find_all('a', target='_blank')”

上面兩處修改，使腳本爬取更加精準有效，容錯能力得到提升

附全部代碼

from bs4 import BeautifulSoup
import requests
import time

fgwurl = 'http://fgw.hunan.gov.cn/fgw/tslm_77952/hgzh/index.html'

def fgw(fgwurl):
    response = requests.get(fgwurl)
    response.encoding='utf-8'
    soup = BeautifulSoup(response.text,'lxml')
    results = soup.find_all('a', target='_blank')for i in results:
        h=str(i)
        if "title" in h:
            #title = i.get_text()#標題
            title = i.get("title")#地址
            url = i.get("href")#地址
            print(title +"  "+ "詳情請點擊" + "  " + url)
        else:
            None

fgw(fgwurl)

參考鏈接：

https://blog.csdn.net/jaray/article/details/106604362

https://www.cnblogs.com/kaibindirver/p/9927297.html

http://blog.sina.com.cn/s/blog_166ae58120102xomk.html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python爬蟲，beatifulsop獲取標籤屬性值（取值）案例

小程序調試工具內置小程序自動化測試工具試用

python-爬蟲-css提取-寫入csv-爬取貓眼電影榜單

典型的爬蟲案例彙總

python-pandas提取網頁內tables（表格類型）數據

sql查詢語句典例整理

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結