個人博客:http://www.chenjianqu.com/
原文鏈接:http://www.chenjianqu.com/show-93.html
使用爬蟲把微博熱搜和天氣預報爬下來,並通過郵件定時發送給自己查看。目錄:
1.爬取微博熱搜
2.郵件發送
3.爬取天氣預報
4.綜合程序
爬取微博熱搜
我這裏使用Python的正則表達式進行爬取,這雖然是一種原始的方式,但是應對簡單的爬蟲任務時卻很有效。首先打開微博熱搜的頁面:https://d.weibo.com/231650_ctg1_-_all#。然後F12進入調試模式。接着根據想要爬去的內容定位到網頁元素,對於想要爬取熱搜的話,可以定位到<ul class="pt_ul clearfix">這裏,標籤包含了我們想要的內容。
下一步,切換到Network窗口,點擊網頁刷新,找到網頁內容文件。經過查找,發現在Doc內容的231650_ctg1_-_all裏面。查看該文件的請求頭的內容,寫代碼的時候需要用到。下面是Python的代碼:
import requests
date_url='https://d.weibo.com/231650_ctg1_-_all'
user_agent = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
header = {
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':user_agent,
'Connection': 'keep-alive',
'Host':'d.weibo.com',
'Referer':r'https://weibo.com/?category=1760',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'Cookie':r'SINAGLOBAL=3157249405177.425.1576929340602; SCF=Al6xXQQ55-6jcuFXUVP0A6SEVlMaKwwCLiZUNjT9niWFZphUNGW7iw5NY4L42KvBQbIpbHZIIsILhHH8bZ5OnbM.; SUHB=0WGdKi-XaWA8Uj; ALF=1611383135; SUB=_2AkMpZMs-f8NxqwJRmPoVxW3rb4VwzAHEieKfODrlJRMxHRl-yT9kqn0vtRB6AuTl0ValAGtvAToNrCinxEZouvLjQMeG; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5va2LfoCEFCfOQu6BQpoCk; login_sid_t=8bc131d1a2f8aa8965871b50f73c6c2d; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; UOR=www.pythontip.com,widget.weibo.com,www.baidu.com; Apache=5219702142720.495.1580745741891; ULV=1580745741899:5:1:1:5219702142720.495.1580745741891:1579837074450; YF-Page-G0=46fe8b26d816d699836422a078175e33|1580745781|1580745767'
}
r = requests.post(url=date_url, headers=header)
raw_text=r.text
re_s1 = r"<ul class=\\\"pt(.*?)/ul>"
re_s2=r"<li(.*?)/li>"
re_pic=r"<img(.*?)>"
re_pic_src=r"src=(.*?)jpg"
re_sub=r"<div class=\\\"subtitle\\\">(.*?)div>"
re_link=r"<a target=\\\"_blank\\\"(.*?)/a>"
re_link_src=r"href=(.*?) class="
re_key=r"#(.*?)#"
s1 = re.findall(re_s1,raw_text,re.S|re.M)
for line_s1 in s1:
s2=re.findall(re_s2,line_s1,re.S|re.M)
#每個熱搜項
for line_s2 in s2:
#獲取關鍵詞
key=re.findall(re_key,line_s2,re.S|re.M)
print(key[0])
#獲取圖片地址
pic_s=re.findall(re_pic,line_s2,re.S|re.M)
src=re.findall(re_pic_src,pic_s[0],re.S|re.M)
print(src[0].replace('\\','')+'jpg')
#獲取子標題
subtitle=re.findall(re_sub,line_s2,re.S|re.M)
print(subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/',''))
#獲取該熱搜的鏈接
link=re.findall(re_link,line_s2,re.S|re.M)
link_src=re.findall(re_link_src,link[0],re.S|re.M)
print(link_src[0].replace('\\',''))
print('\n')
#################################################################################
爬取的結果
遠程辦公
"https://wx4.sinaimg.cn/large/59853be1ly1gbjb12koefj206o06ogoi.jpg
"https://s.weibo.com/weibo?q=%23%E8%BF%9C%E7%A8%8B%E5%8A%9E%E5%85%AC%23"
下一站是幸福
"https://wx1.sinaimg.cn/large/0079PGXzly1gb409yn3poj30dw0dwq3r.jpg
@微博電視劇 推薦:《下一站是幸福》(原《資深少女的初戀》),講述...
"https://s.weibo.com/weibo?q=%23%E4%B8%8B%E4%B8%80%E7%AB%99%E6%98%AF%E5%B9%B8%E7%A6%8F%23"
過多睡眠不利於當前健康調整
"https://wx3.sinaimg.cn/large/6a5ce645ly1gbj96c9fgrj205q05qglj.jpg
3日,國家衛生健康委召開新聞發佈會,北京回龍觀醫院黨委書記楊甫德表...
"https://s.weibo.com/weibo?q=%23%E8%BF%87%E5%A4%9A%E7%9D%A1%E7%9C%A0%E4%B8%8D%E5%88%A9%E4%BA%8E%E5%BD%93%E5%89%8D%E5%81%A5%E5%BA%B7%E8%B0%83%E6%95%B4%23"
李蘭娟迴應疫苗進展
"https://wx4.sinaimg.cn/large/9e5389bbly1gbjaa14qwsj20c80c8t9g.jpg
2月2日凌晨,中國工程院院士、國家衛健委高級別專家組成員李蘭娟帶領...
"https://s.weibo.com/weibo?q=%23%E6%9D%8E%E5%85%B0%E5%A8%9F%E5%9B%9E%E5%BA%94%E7%96%AB%E8%8B%97%E8%BF%9B%E5%B1%95%23"
抗疫行動
"https://wx2.sinaimg.cn/large/005C79Jbly1gbjozauqc6j30dw0dw0ti.jpg
疫情讓人恐懼,也讓我們團結一心!@好友一起#手寫加油接力# 爲身邊的...
"https://s.weibo.com/weibo?q=%23%E6%8A%97%E7%96%AB%E8%A1%8C%E5%8A%A8%23"
2020最大心願
"https://wx2.sinaimg.cn/large/a716fd45ly1gbiy5n6qqrj20dw0dwmzd.jpg
2020最大心願:國泰民安! 轉發海報,一起許下2020年的願望!
"https://s.weibo.com/weibo?q=%232020%E6%9C%80%E5%A4%A7%E5%BF%83%E6%84%BF%23"
武漢最新城市宣傳片
"https://wx2.sinaimg.cn/large/7a273328ly1g7sxt0udwnj20ba0baabb.jpg
"https://s.weibo.com/weibo?q=%23%E6%AD%A6%E6%B1%89%E6%9C%80%E6%96%B0%E5%9F%8E%E5%B8%82%E5%AE%A3%E4%BC%A0%E7%89%87%23"
兒童和孕產婦是新型肺炎易感人羣
"https://wx2.sinaimg.cn/large/60718250ly1gbj8qp16a8j20bl0bl0t1.jpg
"https://s.weibo.com/weibo?q=%23%E5%84%BF%E7%AB%A5%E5%92%8C%E5%AD%95%E4%BA%A7%E5%A6%87%E6%98%AF%E6%96%B0%E5%9E%8B%E8%82%BA%E7%82%8E%E6%98%93%E6%84%9F%E4%BA%BA%E7%BE%A4%23"
福爾摩斯式破解病毒傳染迷局
"https://wx3.sinaimg.cn/large/9e5389bbly1gbjkfi69hvj20dw0dw3yx.jpg
日前,天津某百貨大樓內部相繼出現5例確診病例,從起初的3個病例來看...
"https://s.weibo.com/weibo?q=%23%E7%A6%8F%E5%B0%94%E6%91%A9%E6%96%AF%E5%BC%8F%E7%A0%B4%E8%A7%A3%E7%97%85%E6%AF%92%E4%BC%A0%E6%9F%93%E8%BF%B7%E5%B1%80%23"
寶石gem經紀人迴應
"https://wx2.sinaimg.cn/large/4b79be8bly1gbjcd3ja43j208o08o74u.jpg
"https://s.weibo.com/weibo?q=%23%E5%AE%9D%E7%9F%B3gem%E7%BB%8F%E7%BA%AA%E4%BA%BA%E5%9B%9E%E5%BA%94%23"
手寫加油接力
"https://wx1.sinaimg.cn/large/005C79Jbly1gbig4h9v7dj30dw0dwgm0.jpg
@好友 接力,手寫祝福,爲奮戰在所有一線的工作者們加油打氣,武漢加...
"https://s.weibo.com/weibo?q=%23%E6%89%8B%E5%86%99%E5%8A%A0%E6%B2%B9%E6%8E%A5%E5%8A%9B%23"
寧波一次聚餐祈福25人確診
"https://wx3.sinaimg.cn/large/6a5ce645ly1gbje76xyhkj20dw0dwwfd.jpg
2月3日,據寧波市政府新聞辦召開新聞發佈會通報:患者胡某,無湖北(...
"https://s.weibo.com/weibo?q=%23%E5%AE%81%E6%B3%A2%E4%B8%80%E6%AC%A1%E8%81%9A%E9%A4%90%E7%A5%88%E7%A6%8F25%E4%BA%BA%E7%A1%AE%E8%AF%8A%23"
北京發現41起聚集性病例
"https://wx2.sinaimg.cn/large/9e5389bbly1gbjbvb0z0wj20dw0dwgmx.jpg
今日,北京市新型冠狀病毒感染的肺炎疫情防控工作新聞發佈會介紹,截...
"https://s.weibo.com/weibo?q=%23%E5%8C%97%E4%BA%AC%E5%8F%91%E7%8E%B041%E8%B5%B7%E8%81%9A%E9%9B%86%E6%80%A7%E7%97%85%E4%BE%8B%23"
錦衣之下
"https://wx2.sinaimg.cn/large/006WpiUTly1g8pdxpnafnj30dw0dwdib.jpg
由藝能傳媒、歡瑞世紀、芒果超媒、快樂陽光出品,總導演尹濤、導演劉...
"https://s.weibo.com/weibo?q=%23%E9%94%A6%E8%A1%A3%E4%B9%8B%E4%B8%8B%23"
確診病例門把手測出病毒核酸
"https://wx4.sinaimg.cn/large/a716fd45ly1gbj1jn8ogfj206n06n3yq.jpg
日前,廣州市疾控中心在疫情監測中,在一名確診患者家中門把手上發現...
"https://s.weibo.com/weibo?q=%23%E7%A1%AE%E8%AF%8A%E7%97%85%E4%BE%8B%E9%97%A8%E6%8A%8A%E6%89%8B%E6%B5%8B%E5%87%BA%E7%97%85%E6%AF%92%E6%A0%B8%E9%85%B8%23"
郵件發送
這裏直接按照菜鳥教程的Python郵件發送教程來,使用QQ郵箱作爲SMTP作爲郵件發送服務器。SMTP(Simple Mail Transfer Protocol)即簡單郵件傳輸協議,它是一組用於由源地址到目的地址傳送郵件的規則,由它來控制信件的中轉方式。python的smtplib提供了一種很方便的途徑發送電子郵件。它對smtp協議進行了簡單的封裝。這裏需要在QQ郵箱裏的"設置->帳號管理->開啓POS3/SMTP服務->獲得授權碼",將授權碼作爲登錄的密碼,得到的代碼如下:
#!/usr/bin/python
# -*- coding: UTF-8 -*-
import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
my_sender='[email protected]' # 發件人郵箱賬號
my_pass = 'xxx' # 發件人郵箱密碼
my_user='[email protected]' # 收件人郵箱賬號
def mail():
ret=True
try:
msg=MIMEText('郵件內容:測試','plain','utf-8')
msg['From']=formataddr(["AlexChen",my_sender]) # 括號裏的對應發件人郵箱暱稱、發件人郵箱賬號
msg['To']=formataddr(["JianquChen",my_user]) # 括號裏的對應收件人郵箱暱稱、收件人郵箱賬號
msg['Subject']="郵件測試" # 郵件的主題,也可以說是標題
server=smtplib.SMTP_SSL("smtp.qq.com", 465) # 發件人郵箱中的SMTP服務器,端口是25
server.login(my_sender, my_pass) #
server.sendmail(my_sender,[my_user,],msg.as_string())
server.quit() # 關閉連接
except Exception:
ret=False
return ret
ret=mail()
if ret:
print("郵件發送成功")
else:
print("郵件發送失敗")
更正:這爬的好像不是熱搜,,,但這是不是重點。
爬取天氣預報
直接使用<樹莓派智能家居-天氣預報和實時溫溼度監控>的代碼獲取天氣預報。如下:
import requests
import json
def getWeather(city,date=0):
s=''
rb=requests.get('http://wthrcdn.etouch.cn/weather_mini?city='+city)
#print(rb.text)
data=json.loads(rb.text)
if(data['status']==1000):
d=data['data']
if(date==0):
s+=d['city']+'今天'+d['forecast'][0]['type']+','
s+=d['forecast'][0]['low'][2:]+'到'+d['forecast'][0]['high'][2:]+','
s+=d['forecast'][0]['fengxiang']+d['forecast'][0]['fengli'][8:]+','
s+='當前室外溫度:'+d['wendu']+'度,'
s+=d['ganmao']
elif(date>0 and date<5):
s+=d['city']
if(date==1):
s+='明天'
elif(date==2):
s+='後天'
else:
s+=d['forecast'][date]['date']
s+=d['forecast'][date]['type']+','
s+=d['forecast'][date]['low'][2:]+'到'+d['forecast'][date]['high'][2:]+','
s+=d['forecast'][date]['fengxiang']+d['forecast'][date]['fengli'][8:]
elif(date==-1):
s+=d['city']+'昨天'+d['yesterday']['type']+','
s+=d['yesterday']['low'][2:]+'到'+d['yesterday']['high'][2:]+','
s+=d['yesterday']['fx']+d['yesterday']['fl'][8:]
else:
s='天氣請求失敗'
return s
print(getWeather("欽州市",date=0))
綜合程序
總的程序如下:
# -*- coding: UTF-8 -*-
import datetime
import time
import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
import json
import re
import requests
my_sender='[email protected]' # 發件人郵箱賬號
my_pass = 'xxx' # 發件人郵箱密碼
my_user='[email protected]' # 收件人郵箱賬號,
#定時時刻[小時,分鐘]
my_times=[
[13,57],
[13,54]
]
def getWeather(city,date=0):
s=''
rb=requests.get('http://wthrcdn.etouch.cn/weather_mini?city='+city)
#print(rb.text)
data=json.loads(rb.text)
if(data['status']==1000):
d=data['data']
if(date==0):
s+=d['city']+'今天'+d['forecast'][0]['type']+','
s+=d['forecast'][0]['low'][2:]+'到'+d['forecast'][0]['high'][2:]+','
s+=d['forecast'][0]['fengxiang']+d['forecast'][0]['fengli'][8:]+','
s+='當前室外溫度:'+d['wendu']+'度,'
s+=d['ganmao']
elif(date>0 and date<5):
s+=d['city']
if(date==1):
s+='明天'
elif(date==2):
s+='後天'
else:
s+=d['forecast'][date]['date']
s+=d['forecast'][date]['type']+','
s+=d['forecast'][date]['low'][2:]+'到'+d['forecast'][date]['high'][2:]+','
s+=d['forecast'][date]['fengxiang']+d['forecast'][date]['fengli'][8:]
elif(date==-1):
s+=d['city']+'昨天'+d['yesterday']['type']+','
s+=d['yesterday']['low'][2:]+'到'+d['yesterday']['high'][2:]+','
s+=d['yesterday']['fx']+d['yesterday']['fl'][8:]
else:
s='天氣請求失敗'
return s+'\n'
def getWeibo():
date_url='https://d.weibo.com/231650_ctg1_-_all'
user_agent = r'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
header = {
'Content-Type':'application/x-www-form-urlencoded',
'User-Agent':user_agent,
'Connection': 'keep-alive',
'Host':'d.weibo.com',
'Referer':r'https://weibo.com/?category=1760',
'Sec-Fetch-Mode':'navigate',
'Sec-Fetch-Site':'same-origin',
'Sec-Fetch-User':'?1',
'Upgrade-Insecure-Requests':'1',
'Cookie':r'SINAGLOBAL=3157249405177.425.1576929340602; SCF=Al6xXQQ55-6jcuFXUVP0A6SEVlMaKwwCLiZUNjT9niWFZphUNGW7iw5NY4L42KvBQbIpbHZIIsILhHH8bZ5OnbM.; SUHB=0WGdKi-XaWA8Uj; ALF=1611383135; SUB=_2AkMpZMs-f8NxqwJRmPoVxW3rb4VwzAHEieKfODrlJRMxHRl-yT9kqn0vtRB6AuTl0ValAGtvAToNrCinxEZouvLjQMeG; SUBP=0033WrSXqPxfM72-Ws9jqgMF55529P9D9W5va2LfoCEFCfOQu6BQpoCk; login_sid_t=8bc131d1a2f8aa8965871b50f73c6c2d; cross_origin_proto=SSL; _s_tentry=passport.weibo.com; UOR=www.pythontip.com,widget.weibo.com,www.baidu.com; Apache=5219702142720.495.1580745741891; ULV=1580745741899:5:1:1:5219702142720.495.1580745741891:1579837074450; YF-Page-G0=46fe8b26d816d699836422a078175e33|1580745781|1580745767'
}
r = requests.post(url=date_url, headers=header)
raw_text=r.text
re_s1 = r"<ul class=\\\"pt(.*?)/ul>"
re_s2=r"<li(.*?)/li>"
re_pic=r"<img(.*?)>"
re_pic_src=r"src=(.*?)jpg"
re_sub=r"<div class=\\\"subtitle\\\">(.*?)div>"
re_link=r"<a target=\\\"_blank\\\"(.*?)/a>"
re_link_src=r"href=(.*?) class="
re_key=r"#(.*?)#"
s1 = re.findall(re_s1,raw_text,re.S|re.M)
texts=''
for line_s1 in s1:
s2=re.findall(re_s2,line_s1,re.S|re.M)
#每個熱搜項
for line_s2 in s2:
#獲取關鍵詞
key=re.findall(re_key,line_s2,re.S|re.M)
texts+='\n'+key[0]
#print(key[0])
#獲取圖片地址
pic_s=re.findall(re_pic,line_s2,re.S|re.M)
src=re.findall(re_pic_src,pic_s[0],re.S|re.M)
#texts+='\n'+src[0]
#print(src[0].replace('\\','')+'jpg')
#獲取子標題
subtitle=re.findall(re_sub,line_s2,re.S|re.M)
texts+='\n'+subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/','')
#print(subtitle[0].replace('\\t','').replace('\\n','').replace('<\\/',''))
#獲取該熱搜的鏈接
link=re.findall(re_link,line_s2,re.S|re.M)
link_src=re.findall(re_link_src,link[0],re.S|re.M)
texts+='\n'+link_src[0].replace('\\','')+'\n'
#print(link_src[0].replace('\\',''))
#print('\n')
return texts
def SendEmail():
text='今天的天氣情況:'+getWeather('欽州市')
try:
text+='\n當前的微博熱搜:'+getWeibo()
except Exception:
text+='\n獲取微博熱搜失敗'
ret=True
try:
msg=MIMEText(text,'plain','utf-8')#郵件內容
msg['From']=formataddr(["AlexChen",my_sender]) # 括號裏的對應發件人郵箱暱稱、發件人郵箱賬號
msg['To']=formataddr(["JianquChen",my_user]) # 括號裏的對應收件人郵箱暱稱、收件人郵箱賬號
msg['Subject']="您的微博熱搜到了,請查收!" # 郵件的主題,也可以說是標題
server=smtplib.SMTP_SSL("smtp.qq.com", 465) # 發件人郵箱中的SMTP服務器,端口是25
server.login(my_sender, my_pass) # 括號中對應的是發件人郵箱賬號、郵箱密碼
server.sendmail(my_sender,[my_user,],msg.as_string()) # 括號中對應的是發件人郵箱賬號、收件人郵箱賬號、發送郵件
server.quit() # 關閉連接
except Exception: # 如果 try 中的語句沒有執行,則會執行下面的 ret=False
ret=False
return ret
if __name__=="__main__":
while True:
# 判斷是否達到設定時間
while True:
now = datetime.datetime.now()
for t in my_times:
if now.hour==t[0] and now.minute==t[1]:
ret=SendEmail()
if(ret):
print('郵件發送成功')
else:
print('郵件發送失敗')
time.sleep(60)
time.sleep(20)
郵件結果:
最後將程序部署到服務器上就可以實現每天定時發送微博熱搜和天氣情況給你了。