Python爬蟲學習筆記（一）

原創

巡音luka

2020-02-22 06:38

問題

訪問國防科技大學招生網http://www.gotonudt.cn/site/gfkdbkzsxxw/lqfs/index.html，爬取各省市歷年分數線信息。

步驟

1.引入

import urllib.request as req
import re

2.捕獲網頁源代碼

url = 'http://www.gotonudt.cn/site/gfkdbkzsxxw/lqfs/info/2017/717.html'
webpage = req.urlopen(url)      # 根據超鏈訪問鏈接的網頁
data = webpage.read()           # 讀取超鏈網頁數據
data = data.decode('utf-8')     # byte類型解碼爲字符串

這樣如果輸出data便會得到如下信息：

3.獲取所有表格信息

table = re.findall(r'<table(.*?)</table>', data, re.S)

表格對應如下：

4.預處理

firsttable = table[0]           # 取網頁中的第一個表格
# 數據清洗，將表中的&nbsp，\u3000，和空格號去掉
firsttable = firsttable.replace('&nbsp;', '')
firsttable = firsttable.replace('\u3000', '')
firsttable = firsttable.replace(' ', '')

對應下表：

5.數據分析

def step3():
    score = []
# 請按下面的註釋提示添加代碼，完成相應功能，若要查看詳細html代碼，可在瀏覽器中打開url，查看頁面源代碼。
#********** Begin *********#
# 1.按tr標籤對獲取表格中所有行，保存在列表rows中：
    rows = re.findall(r'<tr(.*?)</tr>', firsttable, re.S)
# 2.迭代rows中的所有元素，獲取每一行的td標籤內的數據，並把數據組成item列表，將每一個item添加到scorelist列表：
    scorelist = []
    for row in rows:
        items = []
        tds = re.findall(r'<td.*?>(.*?)</td>', row, re.S)
        for td in tds:
            rightindex = td.find('</span>')        # 返回-1表示沒有找到
            leftindex = td[:rightindex].rfind('>')
            items.append(td[leftindex+1:rightindex])
        scorelist.append(items)
# 3.將由省份，分數組成的7元列表（分數不存在的用\代替）作爲元素保存到新列表score中，不要保存多餘信息
    for record in scorelist[3:]:
        record.pop()
        score.append(record)
#********** End **********#
    return score

print(step3())

輸出結果如下：

巡音luka

發佈了37 篇原創文章 · 獲贊 11 · 訪問量 4787

私信關注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Python爬蟲學習筆記（一）

問題

步驟

[轉帖]使用NMT和pmap解決JVM資源泄漏問題原創

Python實現大麥網搶票的四大關鍵技術點解析

Python 安裝庫指令大全

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

一款開源的.NET程序集反編譯、編輯和調試神器

關於接口協議，你必須要知道這些！

2020年上半年數據庫系統工程師考試

基於 Milvus + LlamaIndex 實現高級 RAG

【2024-05-21】以茶會友

Tesorflow2.0卷積神經網絡實例之Fashion MINST數據集處理

二進制文件讀寫整型變量

Python爬蟲學習筆記（一）

UDP程序設計之簡單的數據收發

WinSock之簡單的TCP通信程序設計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Python爬蟲 學習筆記（一）

問題

步驟

Python爬蟲學習筆記（一）