用不同方式採集網頁鏈接

原創

2020-02-23 17:34

要求：通過程序下載www.pku.edu.cn網頁，採用不同方法將鏈接全部採集出【注：僅要帶有href，且href的值不能以#開始，不能含有JavaScript/vbscript。】。
1、用字符串處理辦法，將所有鏈接採集出，格式是名稱及其對應鏈接，注：不能依賴BS4；
2、用正則表達式的方法，將所有鏈接採集出，格式是名稱及其對應鏈接，注：不能依賴BS4;
3、用BS的方法，將所有鏈接採集出，格式是名稱及其對應鏈接；

鏈接：點擊打開鏈接

代碼1：

import requests
###  獲取網頁內容
url = 'http://www.pku.edu.cn'#定義url
res = requests.get(url)
res.encoding = "utf-8"  # 設置網頁編碼
###將網頁內容整合爲一行
strTxt=res.text.replace('\n','')

###       開始處理字符串    ##############
#分割字符串
allFind=strTxt.split(r"</a>")
#用字典來保存鏈接
dictA={}
#對每一個含有鏈接的str做處理
for include_a in allFind:
    #尋找鏈接的開始位置
    if('<a' in include_a ):
        #找到<a,並把之前的（沒用的）都截斷
        index_a=include_a.index('<a ')
        tem=include_a[index_a:]
        #找到’>‘，這樣>後面的就是鏈接名稱
        if ('href' in tem and '>' in tem ):
            #print(tem)
            #這樣一分割，nameList[1]就是名稱
            nameList = tem.split('>')
            name=nameList[1]
            #print(name[1])
            ##再次分割，href的後“之前的就是需要的東西
            index_href=nameList[0].index('href')
            ##+6直接去掉    href="
            hrefList=nameList[0][index_href+6:].split("\"")
            #href就是  hrefList[0]
            href=hrefList[0]
            #print(href[0])
            #做一些判斷 按照要求，href開頭不能爲#，href中不能含有javascript/vbscript，有一些圖片，直接刪除
            if (href != '') and (href[0]!= '#') and ('javascript' not in href) and (
                'vbscript' not in href) and 'img'not in name and name!="":
                #做一些優化，如果連接不完整，則補上前綴
                if href[0:4]!='http':
                    href='http://www.pku.edu.cn/'+href
                dictA[name]=href
            #print('------------------------------')
#######                輸出保存           ##########################################################
numA=0
for i in dictA:
    print(i,dictA[i],sep="    ")
    numA+=1
print(numA)

fd=open('./txt/1.txt','w',encoding='utf-8')
for i in dictA:
    print(i,dictA[i],file=fd,sep="\t")
fd.close()

代碼2：

import requests
import re
####  獲取網頁內容    ############################
url = 'http://www.pku.edu.cn'#定義url
res = requests.get(url)
res.encoding = "utf-8"  # 設置網頁編碼
###將網頁內容整合爲一行
strTxt=res.text.replace('\n','')

###       開始處理字符串    ########################
#找到所有的鏈接
allFind=re.findall(r"<a .*?</a>",strTxt)
#用來保存最後的鏈接
dictA={}
#對每個含有鏈接的字符串進行處理
for include_a in allFind:
    ##分割字符串，則split_a[2]爲名稱，split_a[1]爲含有href的字符串
    split_a=re.split('>|<',include_a)
    #print(con[2], con[1], sep="   $$$$$  ")
    #保存名稱
    name=split_a[2]
    #只處理名字不爲空且含有href的字符串
    if len(name)>0 and 'href' in split_a[1]:
        #print(con[1])
        #正則表達式，將href匹配出來
        match_include_href=re.search(r'href=".*?"',split_a[1])
        #處理一些奇怪的異常，
        if match_include_href is None:
            continue
        #截取""的東西
        match_include_yinhao = re.search(r'".*?"', match_include_href.group())
        #去掉左右兩邊的引號
        strHref=match_include_yinhao.group().strip('"')
        # 做一些判斷 按照要求，href開頭不能爲#，href中不能含有javascript/vbscript，有一些圖片，直接刪除
        if (strHref !='') and (strHref[0]!='#')and ('javascript' not in strHref )and ('vbscript'  not in strHref):
            # 做一些優化，如果連接不完整，則補上前綴
            if strHref[0:4] != 'http':
                strHref = 'http://www.pku.edu.cn/' + strHref
            dictA[name] = strHref
            #print('------------------------------------')
#######                輸出保存           ##########################################################
numA=0

for i in dictA:
    print(i,dictA[i],sep="    ")
    numA+=1

print(numA)

fd=open('./txt/2.txt','w',encoding='utf-8')
for i in dictA:
    print(i,dictA[i],file=fd,sep="\t")
fd.close()

代碼3：

import requests
from bs4 import BeautifulSoup
####  獲取網頁內容
url = 'http://www.pku.edu.cn'#定義url
res = requests.get(url)
res.encoding = "utf-8"  # 設置網頁編碼

####       開始處理網頁內容   ########################
soup = BeautifulSoup(res.text, "html.parser")

#print(soup)
#查找帶有href的鏈接
a_include_href=soup.select('a[href]')
#用來保存鏈接
dictA={}
#對每一個鏈接進行處理
for include_a in a_include_href:
    #name 就是include_a.get_text()，href=include_a.attrs['href']
    name=include_a.get_text()
    href=str(include_a.attrs['href'])
    # 做一些判斷 按照要求，href開頭不能爲#，href中不能含有javascript/vbscript，有一些圖片，直接刪除
    if  (href !='') and (href[0]!='#')and ('javascript' not in href )and ('vbscript'  not in href )and (name!="")and( 'img' not in name ):
        # 做一些優化，如果連接不完整，則補上前綴
        if href[0:4] != 'http':
            tem = 'http://www.pku.edu.cn/' + href
            dictA[name] = tem
        else:
            dictA[name]=href
#######                輸出保存           ##########################################################
num_A=0
for i in dictA:
    if(i.strip(' ')!='' ):
        print(i, dictA[i],sep="   ")
        num_A+=1
print("num_ral   :",num_A)
#print(dictA[''])

fd=open('./txt/3.txt','w',encoding='utf-8')
for i in dictA:
    print(i,dictA[i],file=fd,sep="\t")
fd.close()

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用不同方式採集網頁鏈接

公司新來一個幹練小夥，把 MyBatis 替換成 MyBatis-Plus，上線後哭暈在廁所。。。

Testin雲測上線華爲Pura 70系列真機測試服務！

5款開源、美觀、強大的WPF UI組件庫

10分鐘本地運行llama3及初體驗

golang 表格

手寫協議報文 c語言手法

甲骨文(Oracle)宣佈將以74億美元收購Sun公司

javaScript 中的 this 指向誰

用python寫爬蟲demo

決策樹的小項目

交換數組兩個元素

web 技術概論

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結