prometheus + alertmanager + grafana強強聯合

1. Prometheus簡介

Prometheus又稱之爲普羅米修斯，是一個最初在SoundCloud上構建的開源系統監視和警報工具包。自2012年成立以來，許多公司和組織都採用了Prometheus，該項目擁有一個非常活躍的開發人員和用戶社區。它現在是一個獨立的開源項目，可以獨立於任何公司進行維護。 Prometheus於2016年加入CNCF（雲原生計算基金會），作爲繼kubernetes之後的第二個託管項目。

Prometheus具有如下特點：

具有由metric和key/value標識的時間序列數據的多維數據模型；
使用PromQL，在多維度上靈活的查詢語言；
不依賴分佈式存儲，單主節點工作；
通過基於HTTP的pull方式採集時序數據；
可以通過push gateway進行時序列數據推送(pushing)；
通過服務發現或者靜態配置去獲取要採集的目標服務器；
支持多種可視化圖表及儀表盤

Prometheus具有如下優點

易於管理，核心部分只有一個單獨的二進制文件，不存在任何的第三方依賴(數據庫，緩存等等)；
強大的數據模型，所有采集的監控數據均以指標(metric)的形式保存在內置的時間序列數據庫當中(TSDB)；
高效，對於監控系統而言大量的監控任務必然有大量的數據產生，而Prometheus可以高效地處理這些數據，單一Prometheus Server實例可以處理數以百萬的監控指標，每秒處理數十萬的數據點；
豐富的client庫，基於Prometheus豐富的Client庫，用戶可以輕鬆的在應用程序中添加對Prometheus的支持，從而讓用戶可以獲取服務和應用內部真正的運行狀態；
可擴展，每個數據中心、每個團隊可以運行獨立Prometheus Sevrer，同時Prometheus支持聯邦集羣，可以讓多個Prometheus實例產生一個邏輯集羣，當單實例Prometheus Server處理的任務量過大時，通過使用功能分區(sharding)+聯邦集羣(federation)可以對其進行擴展；
易於集成，使用Prometheus可以快速搭建監控服務，並且可以非常方便地在應用程序中進行集成，目前支持： Java， JMX， Python， Go，Ruby， .Net， Node.js等等語言的客戶SDK，基於這些SDK可以快速讓應用程序納入到Prometheus的監控當中，或者開發自己的監控數據收集程序，同時這些客戶端收集的監控數據，不僅僅支持Prometheus，還能支持Graphite這些其他的監控工具

2. Prometheus架構

以下是來自官方的一幅架構圖

(1）Prometheus Server：Prometheus的核心，根據配置完成數據採集，服務發現以及數據存儲

（2）Service discovery：支持根據配置file_sd監控本地配置文件的方式實現服務發現（需配合其他工具修改本地配置文件），同時支持配置監聽kubernetes的API來動態發現服務

（3）Prometheus targets：探針（exporter）提供採集接口，或應用本身提供的支持prometheus數據模型的採集接口

（4）Pushgateway：爲應對部分push場景提供的插件，監控數據先推送到pushgateway上，然後再由server端採集pull（若server採集間隔期間，pushgateway上的數據沒有變化，server將採集2次相同數據，僅時間戳不同）

（5）Alertmanager：告警插件，支持發送告警到郵件，Pagerduty，HipChat，Wechat等

（6）Prometheus web UI：可視化的圖形界面，圖形展示採集的數據

3. 環境準備

現在結合工作中生產環境Prometheus的部署詳細記錄其部署過程

機器名稱	配置	系統	ip地址	角色
prometheus	8C16G	ubuntu16.04	10.13.103.151	prometheus server,grafana server
prometheus-alertmanager	8C16G	ubuntu16.04	10.13.103.152	alertmanager server

3.1 prometheus server部署

prometheus server是prometheus的核心，負責採集數據，存儲數據

# 下載二進制文件並解壓

root@prometheus:~# wget https://github.com/prometheus/prometheus/releases/download/v2.4.3/prometheus-2.4.3.linux-amd64.tar.gz

root@prometheus:~# tar -xf prometheus-2.4.3.linux-amd64.tar.gz -C /data/

root@prometheus:~# cd /data/prometheus-2.4.3/

root@prometheus:/data/prometheus-2.4.3# mkdir log

# 修改prometheus配置文件

root@prometheus:/data/prometheus-2.4.3# vim prometheus.yml
# my global config
global:
scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 25s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.13.103.152:9093 # alertmanager主機地址

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/data/prometheus-2.4.3/rules/node_down.yml" # 實例存活報警規則文件
- "/data/prometheus-2.4.3/rules/memory_over.yml" # 內存報警規則文件
- "/data/prometheus-2.4.3/rules/disk_over.yml" # 磁盤報警規則文件
- "/data/prometheus-2.4.3/rules/cpu_over.yml" # cpu報警規則文件

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'

# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.

static_configs:
- targets: ['localhost:9090']

- job_name: 'GICHOST'
file_sd_configs:
- files: ['./host.json'] # 被監控的主機，可以通過static_configs羅列所有機器，這裏通過file_sd_configs參數加載文件的形式讀取

# 被監控的主機，可以json或yaml格式書寫，我這裏以json格式書寫，target裏面寫監控機器的ip，labels非必須，可以由你自己定義

root@prometheus:/data/prometheus-2.4.3# vim host.json
[
{
"targets":[
"10.13.101.131:9100",
"10.13.101.132:9100",

"10.13.103.251:9100"

],
"labels":{
"host":"GIC_node"
}
},

{
"targets":[
"10.13.101.10:9100",
"10.13.101.11:9100",

"10.13.103.22:9100"

],
"labels":{
"service":"web"
}
}

]

# 配置報警規則，這裏我設置的cpu超過90%報警，內存超過80%報警，磁盤使用超過80%報警

root@prometheus:/data/prometheus-2.4.3# mkdir rules

root@prometheus:/data/prometheus-2.4.3# cd rules

root@prometheus:/data/prometheus-2.4.3/rules# touch cpu_over.yml disk_over.yml memory_over.yml node_down.yml

root@prometheus:/data/prometheus-2.4.3/rules/# ls
cpu_over.yml disk_over.yml memory_over.yml node_down.yml
root@prometheus:/data/prometheus-2.4.3# cd rules/

# cpu報警規則
root@prometheus:/data/prometheus-2.4.3/rules# vim cpu_over.yml
groups:
- name: CPU報警規則
rules:
- alert: NodeCPUUsage
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 90
for: 1m
annotations:
description: "機器: CPU使用超過90%！ (當前值:%)"
summary: "機器: CPU檢測"

# 磁盤報警規則
root@prometheus:/data/prometheus-2.4.3/rules# vim disk_over.yml
groups:
- name: 磁盤報警規則
rules:
- alert: NodeDiskUsage
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
annotations:
description: "機器: 磁盤設備: 使用超過80%！ (掛載點: 當前值:%)"
summary: "機器: 磁盤檢測"

# 內存報警規則
root@prometheus:/data/prometheus-2.4.3/rules# vim memory_over.yml
groups:
- name: 內存報警規則
rules:
- alert: NodeMemoryUsage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
annotations:
description: "機器: 內存使用超過80%！ (當前值:$value%)"
summary: "機器: 內存檢測"

# 機器存活報警
root@prometheus:/data/prometheus-2.4.3/rules# vim node_down.yml
groups:
- name: 機器存活報警規則
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
annotations:
description: "機器: 所屬job: 已經宕機超過1分鐘，請檢查！"
summary: "機器:Instance 存活檢測"

# 設置使用supervisor啓動prometheus，可以保持promethues異常停止後自動啓動，亦可以配置systemd啓動prometheus

root@prometheus:/data/prometheus-2.4.3# apt-get install -y supervisor

root@prometheus:/data/prometheus-2.4.3# cd /etc/supervisor/conf.d/

# 配置prometheus啓動相關事項，config.file設置服務啓動是加載的配置文件，storage.tsdb.path設置採集數據存儲的位置，storage.tsdb.retention設置數據存儲保留的時間

root@prometheus:/etc/supervisor/conf.d# vim prometheus.conf
[program:prometheus]
# 啓動程序的命令;
command = /data/prometheus-2.4.3/prometheus --config.file=/data/prometheus-2.4.3/prometheus.yml --storage.tsdb.path=/data/prometheus-2.4.3/data --storage.tsdb.retention=60d
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
# user = nobody
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/data/prometheus-2.4.3/log/out-prometheus.log
# 錯誤日誌輸出;
stderr_logfile=/data/prometheus-2.4.3/log/err-prometheus.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 20

root@prometheus:/etc/supervisor/conf.d# supervisorctl start prometheus

root@prometheus:/etc/supervisor/conf.d# supervisorctl status

3.2 node_exporter部署

以上prometheus採集到cup，內存，磁盤的數據是通過node_exporter獲取的，需要在被監控機器上部署node_exporter

# 下載node_exporter並解壓

root@prometheus:~# wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz

root@prometheus:~# tar -xf node_exporter-0.16.0.linux-amd64.tar.gz -C /data/

# 配置supervisor啓動node_exporter

root@prometheus:~# cd /etc/supervisor/conf.d/

root@prometheus:/etc/supervisor/conf.d# vim node_exporter.conf
[program:node_exporter]
# 啓動程序的命令;
command = /data/node_exporter-0.16.0/node_exporter
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
# user = nobody
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/data/node_exporter-0.16.0/log/out-node_exporter.log
# 錯誤日誌輸出;
stderr_logfile=/data/node_exporter-0.16.0/log/err-node_exporter.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 20

root@prometheus:/etc/supervisor/conf.d# supervisorctl start node_exporter

root@prometheus:/etc/supervisor/conf.d# supervisorctl status

此時我們可以登錄prometheus默認的web http://10.13.103.151:9090查看監控數據了

3.3 alertmanager server部署

當我們設置的報警值超標後，prometheus觸發報警alert，並傳遞給alertmanager，alertmanager給我們發送告警通知

# 下載alertmanager並解壓

root@prometheus-alertmanager:~# wget https://github.com/prometheus/alertmanager/releases/download/v0.15.1/alertmanager-0.15.1.linux-amd64.tar.gz

root@prometheus-alertmanager:~# tar -xf alertmanager-0.15.1.linux-amd64.tar.gz -C /data

root@prometheus-alertmanager:~# cd /data/alertmanager-0.15.1/

root@prometheus-alertmanager:/data/alertmanager-0.15.1# mkdir log

# 修改alertmanager配置文件

root@prometheus-alertmanager:/data/alertmanager-0.15.1# vim alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications. # 設置郵件發送的相關信息，根據你實際的郵件賬號和密碼設置
smtp_smarthost: 'smtp.exmail.qq.com:25'
smtp_from: 'XXXXXX'
smtp_auth_username: 'XXXXXX'
smtp_auth_password: 'XXXXXX'
smtp_require_tls: false
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 設置微信接口

# The directory from which notification templates are read.
templates:
- '/data/alertmanager-0.15.1/template/*.tmpl' # 設置我們接受信息的模板

# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster', 'service']

# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s

# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m

# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 12h

# A default receiver
receiver: default

receivers:
- name: 'default'
email_configs:
- to: 'appops.capitalonline.net'
# headers: { Subject: "Alertmanager報警郵件"}
wechat_configs: # 設置微信接受的相關賬號信息
- corp_id: 'XXXXXX'
send_resolved: true
to_user: '@all'
# to_party: '2'
agent_id: '1000003'
api_secret: 'XXXXXX'

# 由於默認的微信發送格式比較亂，這裏我們設置微信的格式模板，郵件採用默認的格式

root@prometheus-alertmanager:/data/alertmanager-0.15.1# cd template/

root@prometheus-alertmanager:/data/alertmanager-0.15.1/template# vim wechat.tmpl
{{ define "wechat.default.message" }}
{{ range .Alerts }}
**********start**********
[告警程序]：alertmanager
[告警類型]：{{ .Labels.alertname }}
[故障主機]: {{ .Labels.instance }}
[故障主題]: {{ .Annotations.summary }}
[故障詳情]: {{ .Annotations.description }}
[觸發時間]: {{ .StartsAt }}
**********end**********
{{ end }}
{{ end }}

# 設置supervisor啓動alertmanager

root@prometheus-alertmanager:/data/alertmanager-0.15.1/template# cd /etc/supervisor/conf.d/

root@prometheus-alertmanager:/etc/supervisor/conf.d# vim alertmanager.conf
[program:alertmanager]
# 啓動程序的命令;
command = /data/alertmanager-0.15.1/alertmanager --config.file=/data/alertmanager-0.15.1/alertmanager.yml --storage.path=/data/alertmanager-0.15.1/data/
# 在supervisord啓動的時候也自動啓動;
autostart = true
# 程序異常退出後自動重啓;
autorestart = true
# 啓動5秒後沒有異常退出，就當作已經正常啓動了;
startsecs = 5
# 啓動失敗自動重試次數，默認是3;
startretries = 3
# 啓動程序的用戶;
# user = nobody
# 把stderr重定向到stdout，默認false;
redirect_stderr = true
# 標準日誌輸出;
stdout_logfile=/data/alertmanager-0.15.1/log/out-alertmanager.log
# 錯誤日誌輸出;
stderr_logfile=/data/alertmanager-0.15.1/log/err-alertmanager.log
# 標準日誌文件大小，默認50MB;
stdout_logfile_maxbytes = 20MB
# 標準日誌文件備份數;
stdout_logfile_backups = 20

root@prometheus-alertmanager:/etc/supervisor/conf.d# supervisorctl start alertmanager

root@prometheus-alertmanager:/etc/supervisor/conf.d# supervisorctl status

3.4 grafana server部署

prometheus默認的web UI比較簡單，這裏我們採用grafana結合prometheus來展示採集的數據

root@prometheus:~# curl https://packagecloud.io/gpg.key | sudo apt-key add -

root@prometheus:~# wget https://packagecloud.io/grafana/stable/debian/pool/stretch/main/g/grafana/grafana_5.3.4_amd64.deb

root@prometheus:~# apt-get install grafana

root@prometheus:~# systemctl start grafana-server.service

root@prometheus:~# systemctl enable grafana-server.service

root@prometheus:~# grafana-server -version

登錄grafana web界面http://10.13.103.131:3000 添加data source和dashboard，grafana官方提供和很多dashboard模板可以使用，你可以根據你的需要下載添加，你也可以自己根據你的實際需要自己寫dashboard模板

參考資料:

https://prometheus.io/docs/introduction/overview/

https://github.com/prometheus

prometheus + alertmanager + grafana強強聯合

SQL優化-20231016

金山雲api簽名（go語言）

linux工作利器之二，網絡分析工具tcpdump

linux網絡分析、性能分析、文本格式化、文件讀寫操作之利器(mtr、top、jq、sponge)

kubernetes高可用集羣（多master，v1.15官方最新版）

利用python爬取貝殼網租房信息

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結