bs與re爬起點網站的免費完本小說

原創

2020-07-02 13:30

http://f.qidian.com/all?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page=1&month=-1&style=1&action=-1

這是網站的第一頁，觀察發現，網址中只有page一個變量，便想到用for循環來進行網址的變換。

因爲這是整個網站的小說，所以想着先把小說名字及其鏈接爬下來，再通過每本小說的鏈接將小說章節、鏈接及其內容爬下來，在爬取過程中，我遇到的最大問題是不知道怎麼通過鏈接將小說章節的名字和章節鏈接爬下來,即把下面的文字內容爬下來

後來通過各種查資料，發現只需要把正則寫出來，進行匹配即可，其中用到了finditer

reg = re.finditer(r'<a data-cid="(.*?)" data-eid="qd_G55" href=".*?" target="_blank" title="首發時間：.*?章節字數：.*?">(.*?)</a>',tent)

當然，還有一些細節性的東西也需要掌握！！

1**

        for i in url2:
            url2 = 'http:' + i

2**

            for i in reg:
                # print i.group(1)
                page_url = "http:"+ i.group(1)
                page_name = i.group(2)

3**

content7 = soup.find("div", class_="read-content j_readContent").get_text('\n').encode('utf-8')

4**

                with open(name + '.txt','a') as f:
                    f.write(page_name + '\n' + content7 + '\n')

最後，完整代碼如下：

# -*- coding:utf-8 -*-

from bs4 import BeautifulSoup
import urllib2
import re
import time

for p in range(1,2):
    url = 'http://f.qidian.com/all?size=-1&sign=-1&tag=-1&chanId=-1&subCateId=-1&orderId=&update=-1&page=%d&month=-1&style=1&action=-1' % p
    user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
    headers = {'User-Agent': user_agent}
    response = urllib2.Request(url, headers=headers)
    html = urllib2.urlopen(response).read()
    soup = BeautifulSoup(html, 'html.parser', from_encoding='utf-8')
    content1 = soup.find("div",class_="all-book-list").find_all('h4')

    for k in content1:
        name = k.find('a').get_text(strip=True)
        print name
        k = str(k)
        url2 = re.findall(r'.*?<a.*?href="(.*?)".*?>', k)

        for i in url2:
            url2 = 'http:' + i
            #print url2
            user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
            headers = {'User-Agent': user_agent}
            response = urllib2.Request(url2, headers=headers)
            html3 = urllib2.urlopen(response).read()
            soup = BeautifulSoup(html3, 'html.parser')
            tent = soup.find_all("ul",class_="cf")
            tent = str(tent)
            reg = re.finditer(r'<a data-cid="(.*?)" data-eid="qd_G55" href=".*?" target="_blank" title="首發時間：.*?章節字數：.*?">(.*?)</a>',tent)

            for i in reg:
                #print i.group(1)
                page_url = "http:"+ i.group(1)
                page_name = i.group(2)
                print page_name
                user_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:49.0) Gecko/20100101 Firefox/49.0"
                headers = {'User-Agent': user_agent}
                response = urllib2.Request(page_url, headers=headers)
                html4 = urllib2.urlopen(response).read()
                soup = BeautifulSoup(html4, 'html.parser', from_encoding='utf-8')
                content7 = soup.find("div", class_="read-content j_readContent").get_text('\n').encode('utf-8')

                with open(name + '.txt','a') as f:
                    f.write(page_name + '\n' + content7 + '\n')

                print 'OK1'

                time.sleep(1)

***當然，這裏只是爬了一頁的。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

bs與re爬起點網站的免費完本小說

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

bs與re爬起點網站的免費完本小說

scrapy學習及爬起點小說

python爬網易評論

pyhton爬誅仙小說

爬取下拉加載的動態網頁信息

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結