pyhton爬誅仙小說

原創

2020-07-02 13:30

好吧，第一次寫博客時由於出了一點點小問題，所以還得重新寫。。。

初次爬小說，我還是比較擔心的，因爲不知道應該用python的哪個知識點，在網上查詢之後，發現還是得使用bs,心裏竊喜，接着，要決定爬哪個小說。網上有新浪小說，有網絡小說，再三考慮之後，決定爬誅仙小說。（因爲符合我對小說的定義以及有現成的網址），好啦，言歸正傳，工作開始。

1.首先要有一個明確的思路，最終結果應該是一個文件包含小說的所有章節、題目及小說內容，所以，代碼中應有事先定義的章節題目空列表，鏈接空列表。

2.寫出格式化代碼後，打開小說頁面源代碼，發現章節代碼都很有規律，如下：

所以，正則便可以很容易寫出：

contents = soup.find("div", id="list").find_all("a")

3.在匹配章節題目時遇到難題，不知道怎麼書寫第幾章的正則，後來通過查資料，發現一樣特別神奇的武器——漢字轉化unicode編碼工具。

http://www.bangnishouji.com/tools/chtounicode.html 這是鏈接，可以嘗試--**--

所以，本代碼需要的正則便是

re.findall(re.compile(ur'\u7b2c.*\u7ae0'),item.text)

4.接下來就是匹配正文

content = soup.find("div", id="content")

然後，將不必要的東西通過re.sub刪除。刪除時需將內容複製粘貼即可，即使被刪除的內容很多，如：

cont = re.sub(r'<script src="/js/chaptererror.js" type="text/javascript"></script>', '', cont)

5.綜上，完整代碼如下：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup
import urllib2
import re

url = 'http://www.biquge.tw/26_26491/'
user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
headers = {'User-Agent': user_agent}
response = urllib2.Request(url, headers=headers)
html = urllib2.urlopen(response).read()
soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
contents = soup.find("div", id="list").find_all("a")
#print contents

for item in contents:
    title = []
    href = []

    if re.findall(re.compile(ur'\u7b2c.*\u7ae0'),item.text):
        #print item.text
        title.append(item.text)
        href.append(item['href'])
    for i in range(len(href)):
        try:
            url2 = 'http://www.biquge.tw' + href[i]
            user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
            headers = {'User-Agent': user_agent}
            response = urllib2.Request(url2, headers=headers)
            html2 = urllib2.urlopen(response).read()
            soup = BeautifulSoup(html2, 'html.parser', from_encoding='utf-8')
            content = soup.find("div", id="content")
            cont = str(content)
            cont = re.sub(r'<script src="/js/chaptererror.js" type="text/javascript"></script>', '', cont)
            cont = re.sub(r'</div>', '', cont)
            cont = re.sub(r'<div id="content">', '', cont)
            cont = re.sub(r'<br/>', '\n', cont)
            f = open('ZX.txt','a')
            f.write(title[i].encode('utf-8')+'\n'+cont+'\n')
            f.close()
            print "OK"
        except:
            print  "NO"

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

pyhton爬誅仙小說

微服務實踐k8s&dapr開發部署實驗（2）狀態管理

Win10 LTSC 2019 安裝後的一些步驟

Python 潮流週刊#52：Python 處理 Excel 的資源

bs與re爬起點網站的免費完本小說

scrapy學習及爬起點小說

python爬網易評論

pyhton爬誅仙小說

爬取下拉加載的動態網頁信息

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結