Python爬蟲第二次任務

原創

2019-08-09 01:04

任務預覽（兩天）

2.1 學習beautifulsoup
1.學習beautifulsoup，並使用beautifulsoup提取內容。
2.使用beautifulsoup提取丁香園論壇的回覆內容。
注：丁香園直通點：http://www.dxy.cn/bbs/thread/626626#626626 。
2.2學習xpath
1.學習xpath，使用lxml+xpath提取內容。
2.使用xpath提取丁香園論壇的回覆內容。
注丁香園直通點：http://www.dxy.cn/bbs/thread/626626#626626 。

2.1.1學習beautifulsoup，並使用beautifulsoup提取內容。

BeautifulSoup簡介：
BeautifulSoup提供一些簡單的、Python式的函數用來處理導航、搜索、修改分析樹等功能。它是一個工具箱，通過解析文檔爲用戶提供需要抓取的數據，因爲簡單，所以不需要多少代碼就可以寫出一個完整的應用程序。 BeautifulSoup 自動將輸入文檔轉換爲 Unicode 編碼，輸出文檔轉換爲 utf-8 編碼。你不需要考慮編碼方式，除非文檔沒有指定一個編碼方式，這時你僅僅需要說明一下原始編碼方式就可以了。 BeautifulSoup 已成爲和 lxml、html6lib 一樣出色的 Python 解釋器，爲用戶靈活地提供不同的解析策略或強勁的速度。
導入beautifulsoup的庫：

from bs4 import BeautifulSoup

2.1.2使用beautifulsoup提取丁香園論壇的回覆內容。

從規律上，不難看出，所有評論都是在<td class=“postbody”>中。我們可以通過篩選全部的class="postbody"來獲得評論。

from bs4 import BeautifulSoup
import requests

#利用requests抓取網頁
def get_html(url):
    HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    response = requests.get(url, headers=HEADERS)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response.text
    return None

#通過BeautifulSoup篩選出評論
def get_info(html):
    soup = BeautifulSoup(html, 'lxml')
    return soup.find_all(attrs={'class': 'postbody'})

#主體
if __name__ == '__main__':
    url = "http://www.dxy.cn/bbs/thread/626626#626626"
    html = get_html(url)
    infos = get_info(html)
    for i in range(len(infos)):
        print('-'*100)
        print(infos[i].text.strip())    #利用strip()將獲取的文本前後的空格刪去，只留下文本

結果

2.2.1學習xpath，使用lxml+postbody提取內容。

XPath（XML Path Language），即 XML 路徑語言，它是一門在XML文檔中查找信息的語言。XPath 最初設計是用來搜尋XML文檔的，但是它同樣適用於 HTML 文檔的搜索。所以在做爬蟲時，我們完全可以使用 XPath 來做相應的信息抽取。
XPath常用規則：

表達式	描述
nodename	選取此節點的所有子節點
/	從當前節點選取直接子節點
//	從當前節點選取子孫節點
.	選取當前節點
…	選取當前節點的父節點
@	選取屬性

導入與xpath有關的庫：

from lxml import etree

2.2.2使用xpath提取丁香園論壇的回覆內容。

原理同上

from lxml import etree
import requests

#利用requests抓取網頁
def get_html(url):
    HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 ''(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}
    response = requests.get(url, headers=HEADERS)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response.text
    return None

#通過Xpath篩選出評論
def get_info(html):
    info_html = etree.HTML(html)
    result = info_html.xpath('//td[@class="postbody"]//text()')
    return result

#主體
if __name__ == '__main__':
    url = "http://www.dxy.cn/bbs/thread/626626#626626"
    html = get_html(url)
    infos = get_info(html)
    for i in range(len(infos)):
        print(infos[i].strip())    #利用strip()將獲取的文本前後的空格刪去，只留下文本

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲第二次任務

任務預覽（兩天）

2.1.1學習beautifulsoup，並使用beautifulsoup提取內容。

2.1.2使用beautifulsoup提取丁香園論壇的回覆內容。

2.2.1學習xpath，使用lxml+postbody提取內容。

2.2.2使用xpath提取丁香園論壇的回覆內容。

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

可汗學院：統計學第四次學習

可汗學院：統計學第三次學習

Python爬蟲第四次任務

可汗學院：統計學第二次學習

Python爬蟲第二次任務

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結