python xpath應用--（歷史上的今天）

原創

旋凯凯旋

2020-06-25 08:36

胡說

前段時間搞的紀念日後面說了想搞個歷史上的今天，由於許久沒用python爬取過東西了，就簡單的試了下，用幾行代碼把數據添加進去了，當然今天的重點不是這個，不過還是將它發出來吧。

思路

將數據爬取下來
將數據保存到本地數據庫
從數據庫獲取信息，並隨機打開一條到頁面
連接點擊可調起瀏覽器查看

增加的源碼

"""
            Label(self.page,font=("微軟雅黑", 25),text=title).pack()

            Label(self.page,font=("微軟雅黑", 10),text=start_day+"---"+now_day,fg = "red").pack()
            Label(self.page,font=("微軟雅黑", 20),text=date,fg = "red").pack()

            Label(self.page,font=("微軟雅黑", 15),text=infos).pack()
            
            Button(self.page,text='刪除該紀念日', bd =5,width=10,command=lambda :del_ann(title)).pack(anchor=N)
            #在這段源碼下添加，在Infopage類中

"""            

            #優先查找數據庫看看有沒有
            #沒有就去爬取
            a=thing_showdb(now_day[5:])
            if a==[]:
                year_today_main(now_day[5:])
                time.sleep(1)
                a=thing_showdb(now_day[5:])
                
            year_date=a[randint(0,len(a)-1)][1]
            info=a[randint(0,len(a))][2]
            url=a[randint(0,len(a))][3]
            j=0
            urls=''
            infos=''
            for i in info:
                if j<=10:
                    infos+=i
                    j+=1
                else:
                    infos+=i+'\n'
                    j=0
            j=0
            for i in url:
                if j<=50:
                    urls+=i
                    j+=1
                else:
                    urls+=i+'\n'
                    j=0
            def open_url(event):
                webbrowser.open(url, new=0)
            Label(self.page,font=("微軟雅黑", 10),text='歷史上的今天\n'+year_date).pack()
            Label(self.page,font=("微軟雅黑", 15),text=infos,fg = "red").pack()
            link=Label(self.page,font=("微軟雅黑", 10),text=urls,fg = "blue")
            link.pack()
            link.bind("<Button-1>", open_url)

爬取源碼

該部分我並沒有將數據加進數據庫中，只是將爬取的數據變成字典模式{’時間‘：[事件，詳情網址] }-->>

{'1979年3月7日': ['旅行者1號發現木星','http://www.todayonhistory.com/3/7/LvHangZhe1gHuanDeHangXing.html']}

#那年今日

import requests

import lxml.html

import datetime

today=str(datetime.datetime.now().date())[5:].split('-')

html_url='http://www.todayonhistory.com/%s/%s/'%(today[0],today[1])

# print(html_url)


def get_url(html_url,encode):

    """

    爬取整個網頁內容

    :param html_url:

    :return:

    """

    headers = {

        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"

    }

    response = requests.get(html_url, headers=headers)

    response.encoding = encode

    html_content = response.text

    return html_content

# print(get_url(html_url,'utf-8'))


def get_data(html_content):

  
    metree = lxml.html.etree

    parser = metree.HTML(html_content)

    thing_list= parser.xpath("//ul[@class='oh']/li/div[@class='pic']/div[@class='t']")

#     print(day_list)

    all_thing={}

    for i in thing_list:

        temp=[]

        temp.append(i.xpath("./a/text()")[0])

        temp.append(i.xpath("./a/@href")[0])

        all_thing[i.xpath("./span/text()")[0]]=temp

    print(all_thing)

get_data(get_url(html_url,'utf-8'))


{'1979年3月7日': ['旅行者1號發現木星是帶有光環的行星','http://www.todayonhistory.com/3/7/LvHangZhe-1HaoFaXianMuXingShiDaiYouGuangHuanDeHangXing.html']}

詳細步驟

拼接網址，該網站是按日期獲取數據的，http://www.todayonhistory.com/是主頁，當後面拼接一個日期時就獲得該日的數據，比如http://www.todayonhistory.com/03/07/就是3月7日數據
由此我們可以獲得當日時間並切片成[03,07]的列表模式

today=str(datetime.datetime.now().date())[5:].split('-')#獲取當前日期並切片成[月份，日期]模式

html_url='http://www.todayonhistory.com/%s/%s/'%(today[0],today[1]) #獲得日期拼接網址

# print(html_url)

請求網址獲得網頁內容，編碼右鍵鼠標查看看網頁源代碼，第一句就是<meta charset="utf-8" />，所以我們的請求方式就是get_url(html_url,'utf-8')。

def get_url(html_url,encode):

    """

    爬取整個網頁內容

    :param html_url:

    :return:

    """

    headers = {

        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"

    }

    response = requests.get(html_url, headers=headers)

    response.encoding = encode

    html_content = response.text

    return html_content

然後從返回的網頁內容中查找我們所要的信息，具體方式看圖
知道了佈局我們就可以用xpath一句話來很好的獲取到我們想要的內容：我們從第二層開始爬取
thing_list= parser.xpath("//ul[@class='oh']/li/div[@class='pic']/div[@class='t']")
然後他會返回一個列表，我們再遍歷列表獲取到畫圈的內容標籤外的文本我們用當前標籤的text（）方法獲取，裏面的連接我們就用@+內容索引獲取，比如一個<a href=‘####’> 一二三 </a>標籤一二三就是標籤外的文本，href就是內容索引。
最後我們將遍歷到的內容按自己想要的格式存下來，我這裏先建立一個字典，然後將時間作爲key，事件跟詳情網址添加進一個列表作爲value

def get_data(html_content):

    metree = lxml.html.etree

    parser = metree.HTML(html_content)

    thing_list= parser.xpath("//ul[@class='oh']/li/div[@class='pic']/div[@class='t']")

    all_thing={}

    for i in thing_list:

        temp=[]

        temp.append(i.xpath("./a/text()")[0])

        temp.append(i.xpath("./a/@href")[0])

        all_thing[i.xpath("./span/text()")[0]]=temp

    print(all_thing)

八道

這是一個很簡單的爬取方式，但也很實用，就個人感覺Xpath比BeautifulSoup4比較好用，但是有時候動態網頁就要用selenium模仿用戶使用瀏覽器獲取內容了。

但是我們想要將網站整年數據爬下來的話拼接網址並進行獲取，如果不加限制ip瞬間會被封的，曾經懵懵懂懂的我被各大網站封了一天又一天，但是我們獲取這些數據並沒多大用處，練習一下就好了，我嘗試過一秒獲取2次數據並沒什麼限制，但如果往上我就不知了，我們可以用time.sleep(t)--t=秒數延時一下，百試百靈，比建立ip庫好多的，就是時間有點久。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python xpath應用--（歷史上的今天）

胡說

思路

增加的源碼

爬取源碼

詳細步驟

八道

如何使用 JS 判斷用戶是否處於活躍狀態

Mono 支持LoongArch架構

lightdb秒級增加列和刪除列（not null帶默認值）

lightdb數據庫超時相關控制參數

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

❤️‍🔥 Solon Cloud Event 新的事務特性與應用

lightdb mysql 8.0兼容之不可見主鍵

使用 JS 實現在瀏覽器控制檯打印圖片 console.image()

基於Ubuntu-22.04安裝K8s-v1.28.2實驗（四）使用域名訪問網站應用

python腳本處理事務例子

個人博客記錄

翻譯工具包：word+txt 完結篇

你說的每一句我都記着，還帶時間呢：簡潔版紀念日

你真的會在Linux系統安裝應用？

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結