python+selenium定時爬取丁香園的新冠病毒每天的數據，並製作出類似的地圖（部署到雲服務器）

聲明：僅供技術交流，請勿用於非法用途，如有其它非法用途造成損失，和本博客無關

前言

硬要說這篇文章怎麼來的，那得先從那幾個吃野味的人開始說起……
前天睡醒：假期還有幾天；昨天睡醒：假期還有十幾天；今天睡醒：假期還有一個月……
每天過着幾乎和每個假期一樣的宅男生活，唯一不同的是玩手機已不再是看劇、看電影、打遊戲了，而是每天都在關注着這次新冠肺炎疫情的新聞消息，真得希望這場戰“疫”快點結束，讓我們過上像以前一樣的生活。
武漢加油！中國加油！！

本次爬取的網站是丁香園點擊跳轉，相信大家平時都是看這個的吧。

一、準備

python3.7
selenium：自動化測試框架，直接pip install selenium安裝即可
pyecharts：以一切皆可配置而聞名的python封裝的js畫圖工具，其官方文檔寫的很詳細了點擊跳轉。直接pip install pyecharts安裝即可，同時還需安裝以下地圖的包：

世界地圖：pip install echarts-countries-pypkg
中國地圖：pip install echarts-china-provinces-pypkg
中國城市地圖：pip install echarts-china-cities-pypkg

雲服務器

二、爬取數據+畫圖

第一步、分析頁面

先用個requests模塊請求一下，看能不能拿到數據：

import requests
url='https://ncov.dxy.cn/ncovh5/view/pneumonia_peopleapp?from=timeline&isappinstalled=0'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
r=requests.get(url,headers=headers)
print(r.text)

發現數據是亂碼的並且注意到末尾處有如下字樣：

<noscript>You need to enable JavaScript to run this app.</noscript>

意思是需要執行js代碼，百度了一下發現這個頁面應該是用react.js來開發的。限於自身技術能力，這個時候，我就只能用selenium了，它是完全模擬瀏覽器的操作，也即能執行js代碼。

並且我需要拿到的數據並不多，也就一個頁面而已，所以耗時也可以接受。

那麼我要拿哪些數據呢，如下：

截至當前時間的全國數據統計
病毒相關描述信息
全國各個省份及其城市的所有數據
全世界各個地區的數據

經過查看，發現這幾處需要進行點擊，才能獲取到更多數據信息：

第二步、編寫代碼

導入相關包：

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import parsel
import time
import json
import os
import datetime
import pyecharts
from pyecharts import options as opts

定義爬取數據、保存數據的函數：

def get_save_data():
	'''
	部署到雲服務器上時，注意:要安裝pyvirtualdisplay模塊，
	並且把下面的前5條註釋掉的代碼給去掉註釋，再運行，不然會報錯。
	'''
	#from pyvirtualdisplay import Display
	#display = Display(visible=0, size=(800, 600))
    #display.start()
    options=webdriver.ChromeOptions()
    #options.add_argument('--disable-gpu')
    #options.add_argument("--no-sandbox")
    options.add_argument('--headless') #採用無頭模式進行爬取
    d=webdriver.Chrome(options=options)
    d.get('https://ncov.dxy.cn/ncovh5/view/pneumonia_peopleapp?from=timeline&isappinstalled=0')
    time.sleep(2)
    ActionChains(d).move_to_element(d.find_element_by_xpath('//p[@class="mapTap___1k3MH"]')).perform()
    time.sleep(2)
    d.find_element_by_xpath('//span[@class="openIconView___3hcbn"]').click()
    time.sleep(2)
    for i in range(3):
        mores=d.find_elements_by_xpath('//div[@class="areaBox___3jZkr"]')[1].find_elements_by_xpath('./div')[3:-1]
        ActionChains(d).move_to_element(d.find_element_by_xpath('//div[@class="rumorTabWrap___2kiW4"]/p')).perform()
        mores[i].click()
        time.sleep(2)
    response=parsel.Selector(d.page_source)
    china=response.xpath('//div[@class="areaBox___3jZkr"]')[0]
    world=response.xpath('//div[@class="areaBox___3jZkr"]')[1]

    # 下面是病毒相關描述信息的獲取與處理
    content=response.xpath('//div[@class="mapTop___2VZCl"]/div[1]//text()').getall()
    s=''
    for i,j in enumerate(content):
        s=s+j
        if (i+1)%2 == 0:
            s=s+'\n'
        if j in ['確診','疑似','重症','死亡','治癒']:
            s=s+'\n'
    now=s.strip()
    msg=response.xpath('//div[@class="mapTop___2VZCl"]/div//text()').getall()
    s=''
    for i in msg:
        if i not in now:
            s=s+i+'\n'
    msg=s.strip()
    content=msg+'\n\n'+now

    # 下面是全國數據的獲取
    china_data=[]
    for div_list in china.xpath('./div')[2:-1]:
        flag=0
        city_list=[]
        for div in div_list.xpath('./div'):
            if flag == 0:
                if div.xpath('./p[1]/text()').get() is not None:
                    item={}
                    item['省份']=div.xpath('./p[1]/text()').get()
                    item['確診']=div.xpath('./p[2]/text()').get() if div.xpath('./p[2]/text()').get() is not None else '0'
                    item['死亡']=div.xpath('./p[3]/text()').get() if div.xpath('./p[3]/text()').get() is not None else '0'
                    item['治癒']=div.xpath('./p[4]/text()').get() if div.xpath('./p[4]/text()').get() is not None else '0'
                    flag=1
            else:
                if div.xpath('./p[1]/span/text()').get() is not None:
                    temp={}
                    temp['城市']=div.xpath('./p[1]/span/text()').get()
                    temp['確診']=div.xpath('./p[2]/text()').get() if div.xpath('./p[2]/text()').get() is not None else '0'
                    temp['死亡']=div.xpath('./p[3]/text()').get() if div.xpath('./p[3]/text()').get() is not None else '0'
                    temp['治癒']=div.xpath('./p[4]/text()').get() if div.xpath('./p[4]/text()').get() is not None else '0'
                    city_list.append(temp)
        item.update({'city_list':city_list})
        china_data.append(item)

    # 下面是全球數據的獲取
    world_data=[]
    for div_list in world.xpath('./div')[2:-1]:
        flag=0
        country_list=[]
        for div in div_list.xpath('./div'):
            if flag == 0:
                if div.xpath('./p[1]/text()').get() is not None:
                    item={}
                    item['地區']=div.xpath('./p[1]/text()').get()
                    item['確診']=div.xpath('./p[2]/text()').get() if div.xpath('./p[2]/text()').get() is not None else '0'
                    item['死亡']=div.xpath('./p[3]/text()').get() if div.xpath('./p[3]/text()').get() is not None else '0'
                    item['治癒']=div.xpath('./p[4]/text()').get() if div.xpath('./p[4]/text()').get() is not None else '0'
                    flag=1
            else:
                if div.xpath('./p[1]/span/text()').get() is not None:
                    temp={}
                    temp['國家']=div.xpath('./p[1]/span/text()').get()
                    temp['確診']=div.xpath('./p[2]/text()').get() if div.xpath('./p[2]/text()').get() is not None else '0'
                    temp['死亡']=div.xpath('./p[3]/text()').get() if div.xpath('./p[3]/text()').get() is not None else '0'
                    temp['治癒']=div.xpath('./p[4]/text()').get() if div.xpath('./p[4]/text()').get() is not None else '0'
                    country_list.append(temp)
        item.update({'country_list':country_list})
        world_data.append(item)
    d.quit()

    # 下面是保存數據的操作
    if not os.path.exists('./json'):
        os.makedirs('./json')
    if not os.path.exists('./txt'):
        os.makedirs('./txt')
    now_time=datetime.datetime.now().strftime("%Y-%m-%d") #獲取當前日期
    index=list(range(len(china_data)))
    data=dict(zip(index,china_data))
    json_str = json.dumps(data, indent=4,ensure_ascii=False)
    with open(f'./json/{now_time}.json', 'w', encoding='utf-8') as f:
        f.write(json_str)
    index=list(range(len(world_data)))
    data=dict(zip(index,world_data))
    json_str = json.dumps(data, indent=4,ensure_ascii=False)
    with open(f'{now_time}.json', 'w', encoding='utf-8') as f:
        f.write(json_str)
    with open(f'./txt/{now_time}.txt', 'w', encoding='utf-8') as f:
        f.write(content)

定義畫地圖的函數，輸出是一個html文件：

def get_html():
    # 首先是加載爬取到的數據
    json_files=os.listdir('./json')
    json_data=[]
    date=[]
    for i in json_files:
        with open(f'./json/{i}','r',encoding='utf-8') as f:
            date.append(i.split('.')[0])
            temp=json.load(f)
            json_data.append(list(temp.values()))
    txt_files=os.listdir('./txt')   
    content_list=[]
    for i in txt_files:
        with open(f'./txt/{i}','r',encoding='utf-8') as f:
            content_list.append(f.read())
            
    # 下面開始畫圖
    t=pyecharts.charts.Timeline(init_opts=opts.InitOpts(width='1400px',height='1400px',page_title='武漢加油！中國加油！！'))
    for s,(i,data) in enumerate(zip(date,json_data)):
        value=[] # 儲存確診人數
        attr=[] # 儲存城市名字
        for each in data:
            attr.append(each['省份'])
            value.append(int(each['確診']))
        map0 = (
            pyecharts.charts.Map()
            .add(
                series_name='該省份確診數',data_pair=list(zip(attr,value)),maptype='china',is_map_symbol_show=True,zoom=1.1
            )
            .set_global_opts(title_opts=opts.TitleOpts(title="武漢加油！中國加油！！", # 標題
                                                subtitle=content_list[s], # 副標題
                                                title_textstyle_opts=opts.TextStyleOpts(color='red',font_size=30), # 標題文字
                                                subtitle_textstyle_opts=opts.TextStyleOpts(color='black',font_size=20),item_gap=20), # 副標題文字
                      visualmap_opts=opts.VisualMapOpts(pieces=[{"max": 9, "min": 1,'label':'1-9','color':'#FFEBCD'},
                                                                {"max": 99, "min": 10,'label':'10-99','color':'#F5DEB3'},
                                                                {"max": 499, "min": 100,'label':'100-499','color':'#F4A460'},
                                                                {"max": 999, "min": 500,'label':'500-999','color':'#FA8072'},
                                                                {"max": 9999,"min": 1000,'label':'1000-9999','color':'#ee2c0f'},
                                                                {"min": 10000,'label':'≥10000','color':'#5B5B5B'}],
                                                        is_piecewise=True,item_width=45,item_height=30,textstyle_opts=opts.TextStyleOpts(font_size=20))
            )
        )
        t.add(map0, "{}".format(i))
    # 將這幅圖保存爲html文件
    t.render('武漢加油！中國加油！！.html')

程序入口：

if __name__ == '__main__':
	get_save_data()
	get_html()

第三步、結果展示

運行該程序之後，會在當前目錄下生成一個武漢加油！中國加油！！.html的文件，打開之後如下：

ps：因爲只能上傳圖片，所以我就將html轉爲圖片了，html是動態的，有時間軸可以拖動，由於昨天才剛開始爬數據，所以只有兩天的數據。下面附上轉圖片的代碼：

ps：又因爲這個Timeline時間線輪播多圖，配置不了背景顏色，發現生成的圖片放大看變成黑色背景的，於是研究了一下源碼，自己修改了一下js那塊的代碼，然後就生成可以設置背景顏色的圖片了

from selenium import webdriver
import base64
import os

def decode_base64(data: str) -> bytes:
    """Decode base64, padding being optional.

    :param data: Base64 data as an ASCII byte string
    :returns: The decoded byte string.
    """
    missing_padding = len(data) % 4
    if missing_padding != 0:
        data += "=" * (4 - missing_padding)
    return base64.decodebytes(data.encode("utf-8"))
def save_as_png(image_data: bytes, output_name: str):
    with open(output_name, "wb") as f:
        f.write(image_data)

if __name__ == '__main__':
	options=webdriver.ChromeOptions()
	options.add_argument('--headless')
	d=webdriver.Chrome(options=options)
	url='file://'+os.path.abspath('武漢加油！中國加油！！.html')
	d.get(url)
	js = """
	    var ele = document.querySelector('div[_echarts_instance_]');
	    var mychart = echarts.getInstanceByDom(ele);
	    return mychart.getDataURL({
	        type: 'png',
	        pixelRatio: 2,
	        backgroundColor:'#FFFFFF',
	        excludeComponents: ['toolbox']
	    });
	"""
	content=d.execute_script(js)
	content_array = content.split(",")
	image_data = decode_base64(content_array[1])
	save_as_png(image_data, '武漢加油！中國加油！！.png')
	d.quit()

ps:還可以直接在源碼裏面修改，路徑在D:\XXX\python3.7\Lib\site-packages\snapshot_selenium\snapshot.py，將裏面的要執行的js代碼改成如下：

SNAPSHOT_JS = """
    var ele = document.querySelector('div[_echarts_instance_]');
    var mychart = echarts.getInstanceByDom(ele);
    return mychart.getDataURL({
        type: '%s',
        pixelRatio: %s,
        backgroundColor:'#FFFFFF',
        excludeComponents: ['toolbox']
    });
"""

然後調用就簡單了很多：
注意：修改源碼後要重啓python環境

from pyecharts.render import make_snapshot
from snapshot_selenium import snapshot
make_snapshot(snapshot,'武漢加油！中國加油！！.html','武漢加油！中國加油！！.png')

三、部署到雲服務器

1.定時運行獲取數據

首先將爬取數據的函數，即get_save_data()單獨放到一個py文件中（我命名爲：2019-nCoV.py）。然後修改定時任務/etc/crontab文件，如下：

2.通過微信獲取地圖（html文件）

把畫地圖的函數，即get_html()添加到個人微信機器人當中，然後設置特定判斷條件，在手機微信上向文件傳輸助手發送設定好的指令，執行get_html()函數，然後把執行函數後生成的html文件發給文件傳輸助手，從而獲取到當前的疫情地圖。

個人微信機器人的代碼我就不再展示了，可以看我之前的文章：python實現微信自動回覆機器人+查看別人撤回的消息（部署到雲服務器）

特定判斷的語句如下：

if '2019' == msg['Text']:
	get_html()
	itchat.send('@fil@%s'%'武漢加油！中國加油！！.html',toUserName='filehelper')

同時，也可以把剛剛的獲取數據的函數一起添加進去的，然後同樣通過發送特定指令運行函數，而獲取數據，我這裏不加進去呢，是因爲我要設置個定時任務，定時獲取就行了；並且我也可以通過給文件傳輸助手發送shell命令，執行py文件。

把下面的代碼加進個人微信機器人py文件裏就行了。

import subprocess
def cmd(command):
    output=subprocess.getoutput(command)
    return output

並給出我的特定判斷語句：

if 'cmd' in msg['Text']:
	output=cmd(msg['Text'][3:])
	if output != '':
	    itchat.send(output, toUserName='filehelper')

四、運行展示

如上圖所示：我先是執行了爬取數據的函數，即我調用了雲服務器上的定時爬取數據的py文件，然後再輸入指令獲取當前的疫情地圖，打開後像上面的疫情地圖一樣。

寫在最後

世界的疫情地圖我沒有畫，是因爲pyecharts的世界地圖各個地區是用英文命名的，跟獲取到的地區匹配不上，其實可以加個中文轉英文給它，那就可以了，我懶的弄了，有興趣的朋友可以試一試哦。不過我也把數據爬取下來了，這樣以後想畫的話也不至於連數據都沒有哈哈

一開始，我只是在那些爬蟲微信羣上看到：今天這誰在爬丁香園的數據，過幾天又看到那誰又在爬丁香園的數據，而且還提出各種問題來討論。我實在是看不下去了，於是就有了這一篇文章（反正在家閒着也是閒着）

然後呢，今天學校發通知說校外的大四學生也可以申請vpn，然後在家就可以查看和下載知網的文獻了。準備畢業的我突然驚了，我的論文還未開始寫呢！看來是時候了……

其實我是想回學校再寫的，但是這次的新冠肺炎疫情來勢兇猛，真的希望快點好起來啊~

武漢加油！中國加油！！

python+selenium定時爬取丁香園的新冠病毒每天的數據，並製作出類似的地圖（部署到雲服務器）

python+selenium定時爬取丁香園的新冠病毒每天的數據，並製作出類似的地圖（部署到雲服務器）

目錄

前言

一、準備

二、爬取數據+畫圖

第一步、分析頁面

第二步、編寫代碼

第三步、結果展示

三、部署到雲服務器

1.定時運行獲取數據

2.通過微信獲取地圖（html文件）

四、運行展示

寫在最後

2024年DataOps趨勢預測：AI不會取代數據工程師

雲原生週刊：K8s 中的服務和網絡｜ 2024.4.29

通過Http鏈接地址爬取有贊微信商城商品信息及下載至EXCEL

多人同時導出 Excel 幹崩服務器！新來的阿里大佬給出的解決方案太優雅了！

[轉帖]cpupower

今天，昨天，近七天，近30天，近90天，js封裝

華爲云云原生FinOps解決方案，釋放雲原生最大價值

python多線程爬取加密後ts文件，解密後合成mp4視頻

python多線程爬取ts文件併合成mp4視頻

python爬取Instagram上偶像的帖子（包括圖片和視頻）

【Scrapy學習心得】添加隨機用戶代理

python+appium爬取微信運動數據，並分析好友的日常步數情況

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結