爬蟲爬取小說網站的內容，並將各章節輸出到各txt文件

原創

2020-02-23 17:11

一、確定網站鏈接

代碼用到的鏈接，是在 https://www.biqukan.com 主頁選的一個連載小說的鏈接

from bs4 import BeautifulSoup
import requests

link = 'https://www.biqukan.com/1_1094'

二、查看網頁源代碼

發現：
1、網站是gbk編碼的

2、章節都是有a標籤的，要過濾出來這部分內容
3、我們要的是從正文捲開始的章節，想到切片截取

# 獲取結果res，編碼是gbk（這個網站就是gbk的編碼）
res = requests.get(link)
res.encoding = 'gbk'

# 使用BeatifulSoup得到網站中的文本內容
soup = BeautifulSoup(res.text)
lis = soup.find_all('a')	# 
lis = lis[42:-13]           # 不屬於章節內容的都去掉

# 用urllist存儲所有{章節名稱:鏈接}
urldict = {}

# 觀察小說各個章節的網址，結合後面的代碼，這裏只保留 split_link = 'https://www.biqukan.com/'
tmp = link.split("/")
split_link = "{0}//{1}/".format(tmp[0], tmp[2])

# 將各章節名字及鏈接形成鍵值對形式，並添加到大字典 urldict中
for i in range(len(lis)):
    print({lis[i].string: split_link + lis[i].attrs['href']})
    urldict.update({lis[i].string: split_link + lis[i].attrs['href']})

from tqdm import tqdm
for key in tqdm(urldict.keys()):
    tmplink = urldict[key]          # 章節鏈接
    res = requests.get(tmplink)     # 鏈接對應的資源文件html
    res.encoding = 'gbk'

    soup = BeautifulSoup(res.text)  # 取資源文件中的文本內容
    content = soup.find_all('div', id='content')[0]  # 取得資源文件中文本內容的小說內容

    with open('text{}.txt'.format(key), 'a+', encoding='utf8') as f:
        f.write(content.text.replace('\xa0', ''))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

爬蟲爬取小說網站的內容，並將各章節輸出到各txt文件

一、確定網站鏈接

二、查看網頁源代碼

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

python使用xlrd和xlwt模塊對Excel文件讀寫（實例：將點座標轉爲無向圖距離）

matlab與python的交互

hdu2023求平均成績杭電OJ Compilation error

分別用numpy和pandas劃分數據集以完成交叉驗證

進程同步水果問題

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結