Python爬蟲爬取糗事百科段子

原創

Waterkong

2020-02-22 00:44

代碼有可能出現以下錯誤：

'gbk' codec can't encode character u'\xa0' in position 3621: illegal multiby

網頁編碼問題，所學有限暫不能完全解決。但實驗發現利用 " gb2312" 不會出現以上問題，但是無法正常輸出。

利用 " utf-8" 解碼有可能會出現以上問題。

所實現的功能，爬取了糗事百科的文字類的段子，運行程序，點回車一次查看一個段子，輸入Q退出程序。

整體的思路：

1. 寫兩個函數，一個 getonepage 抓取指定頁碼，該頁的所有段子。一個QSBK_main 函數控制業務邏輯。

2. getonepage 首先利用 urllib 的 urlopen 方法抓取頁面，然後用 beautifulsoup 的 find_all 方法提取我們所需要的作者和段子

這裏作者和段子是分別提取的，然後利用 list 的 appned 方法合在了一起命名爲 item ，也就是所一個段子是一個item .。然後將整個頁面的所有段子用一個 items[ ] 返回。

3.QSBK_main 這個很簡單，新東西只有 get_text 方法，該方法是用來提取文字的。

相關的學習資料：http://cuiqingcai.com/990.html

環境：python + urllib +beautifulsoup

代碼如下：

# -*- coding:utf-8 -*-
#import requests
from urllib import request
from bs4 import BeautifulSoup
import re

items = []
def geonepage(page):
    print("開始獲取，請等待少許.......")
    url = 'http://www.qiushibaike.com/text/page/' + str(page)
    req = request.Request(url)
    req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
    response = request.urlopen(req).read()
    response =response.decode('utf-8', errors = 'ignore')
    soup = BeautifulSoup(response, 'html.parser')
    count = 0
    name_all = soup.find_all('h2')
    item_all = soup.find_all('div', {'class':'content'})
    for x in name_all:
        item = name_all[count]
        item.append(item_all[count]) 
        items.append(item)
        count += 1
    print("獲取完成，回車查看！")
    return items

def QSBK_main():
    enable = True
    print("回車查看段子，輸入Q離開頁面。")
    count_1 = 1
    page = 1
    while enable:
        geonepage(page)
        for story in items:
            inp = input()
            if inp == "Q":
                enable = False
                break
            else:
                print("第", page, "頁","第", count_1, "個" )
                print(story.get_text())
                count_1 += 1
        page += 1
        count_1 = 1

QSBK_main()

Waterkong

發佈了38 篇原創文章 · 獲贊 15 · 訪問量 4萬+

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲爬取糗事百科段子

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Sublime Text 3 快捷鍵彙總

windows 10 下MySQL 5.7.18安裝教程

我的第一個爬蟲

ubuntu 安裝 jdk 與環境配置

Python爬蟲爬取糗事百科段子

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結