Prometheus+Altermanger+Grafana+node-exporter安裝和使用

簡述

Prometheus是一個開源的系統監控和警報工具，該項目擁有非常活躍的開發人員和用戶社區。它現在是一個獨立的開源項目，獨立於任何公司進行維護。Prometheus於2016年加入CNCF（雲原生計算基金會），成爲繼 Kubernetes之後的第二個託管項目。2018年8月9日，CNCF宣佈開放源代碼監控工具 Prometheus已從孵化狀態進入畢業狀態。

Prometheus的主要特性：

一個多維數據模型，包含由metric和key/value標識的時間序列數據
PromQL是一種靈活的查詢語言
不依賴分佈式存儲，單個服務器節點是自治的
基於HTTP協議通過pull形式進行收集時間序列數據
push形式的時間序列數據是通過一箇中間網關來支持的
targets可以通過服務發現或靜態配置發現的
多種模式的圖形和儀表盤支持

下圖描述了Prometheus的架構和生態系統。

Prometheus Server：用於收集和存儲時間序列數據。
Client Library：客戶端庫，爲需要監控的服務生成相應的metrics並暴露給Prometheus Server。當Prometheus Server來pull時，直接返回實時狀態的metrics。
Push Gateway：主要用於短期的jobs。由於這類jobs存在時間較短，可能在Prometheus來pull之前就消失了。爲此，這類jobs可以直接向Prometheus Server端推送它們的metrics。這種方式主要用於服務層面的metrics，對於機器層面的metrices，需要使用node exporter。
Exporters：用於暴露已有的第三方服務的metrics給Prometheus。
Alertmanager：從Prometheus Server端接收到Alerts後，會進行去除重複數據，分組，並路由到不同的告警接收方式，發出報警。常見的接收方式有：電子郵件，pagerduty，OpsGenie, webhook等。
Web UI：Prometheus內置一個簡單的Web控制檯，可以查詢指標，查看配置信息或者Service Discovery等，實際工作中，查看指標或者創建儀表盤通常使用Grafana，Prometheus作爲Grafana的數據源。
安裝環境

IP	主機名	安裝軟件
192.168.1.69	prometheus-node1	prometheus, node-exporter, grafana
192.168.1.70	prometheus-node2	node-exporter, alertmanager

安裝Prometheus

下載https://github.com/prometheus/prometheus/releases/download/v2.16.0/prometheus-2.16.0.linux-amd64.tar.gz。然後上傳prometheus-2.16.0.linux-amd64.tar.gz到prometheus-node1節點並安裝prometheus。

tar xzvf prometheus-2.16.0.linux-amd64.tar.gz
mkdir /usr/local/prometheus
mv prometheus-2.16.0.linux-amd64 /usr/local/prometheus/prometheus
cd /usr/local/prometheus/prometheus

查看版本號。

./prometheus --version
prometheus, version 2.16.0 (branch: HEAD, revision: b90be6f32a33c03163d700e1452b54454ddce0ec)
  build user:       root@7ea0ae865f12
  build date:       20200213-23:50:02
  go version:       go1.13.8

啓動prometheus進程。

./prometheus
level=info ts=2020-03-23T15:23:27.799Z caller=main.go:295 msg="no time or size retention was set so using the default time retention" duration=15d
level=info ts=2020-03-23T15:23:27.799Z caller=main.go:331 msg="Starting Prometheus" version="(version=2.16.0, branch=HEAD, revision=b90be6f32a33c03163d700e1452b54454ddce0ec)"
level=info ts=2020-03-23T15:23:27.799Z caller=main.go:332 build_context="(go=go1.13.8, user=root@7ea0ae865f12, date=20200213-23:50:02)"
level=info ts=2020-03-23T15:23:27.799Z caller=main.go:333 host_details="(Linux 3.10.0-1062.18.1.el7.x86_64 #1 SMP Tue Mar 17 23:49:17 UTC 2020 x86_64 prometheus (none))"
level=info ts=2020-03-23T15:23:27.799Z caller=main.go:334 fd_limits="(soft=1024, hard=4096)"
level=info ts=2020-03-23T15:23:27.800Z caller=main.go:335 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2020-03-23T15:23:27.809Z caller=main.go:661 msg="Starting TSDB ..."
level=info ts=2020-03-23T15:23:27.811Z caller=web.go:508 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2020-03-23T15:23:27.817Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1584947587486 maxt=1584950400000 ulid=01E43E3D9ECWQY6T5HZ04R15PX
level=info ts=2020-03-23T15:23:27.818Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1584950400000 maxt=1584957600000 ulid=01E43GSF6T3PQRD2BN4XFHN52Y
level=info ts=2020-03-23T15:23:27.819Z caller=repair.go:59 component=tsdb msg="found healthy block" mint=1584957600000 maxt=1584964800000 ulid=01E43QN6EDJH3J5CP4F00MY8DN
level=info ts=2020-03-23T15:23:27.852Z caller=head.go:577 component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-03-23T15:23:27.966Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=0 maxSegment=7
level=info ts=2020-03-23T15:23:27.987Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=1 maxSegment=7
level=info ts=2020-03-23T15:23:28.100Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=2 maxSegment=7
level=info ts=2020-03-23T15:23:28.192Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=3 maxSegment=7
level=info ts=2020-03-23T15:23:28.193Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=4 maxSegment=7
level=info ts=2020-03-23T15:23:28.195Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=5 maxSegment=7
level=info ts=2020-03-23T15:23:28.195Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=6 maxSegment=7
level=info ts=2020-03-23T15:23:28.195Z caller=head.go:625 component=tsdb msg="WAL segment loaded" segment=7 maxSegment=7
level=info ts=2020-03-23T15:23:28.198Z caller=main.go:676 fs_type=XFS_SUPER_MAGIC
level=info ts=2020-03-23T15:23:28.198Z caller=main.go:677 msg="TSDB started"
level=info ts=2020-03-23T15:23:28.198Z caller=main.go:747 msg="Loading configuration file" filename=prometheus.yml
level=info ts=2020-03-23T15:23:28.204Z caller=main.go:775 msg="Completed loading of configuration file" filename=prometheus.yml
level=info ts=2020-03-23T15:23:28.205Z caller=main.go:630 msg="Server is ready to receive web requests."

訪問prometheus，http://192.168.1.69:9090/。

也可以訪問prometheus自監控指標，http://192.168.1.69:9090/metrics。

停止進程可以使用ctl+c。

^Clevel=warn ts=2020-03-23T15:23:30.646Z caller=main.go:507 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2020-03-23T15:23:30.646Z caller=main.go:530 msg="Stopping scrape discovery manager..."
level=info ts=2020-03-23T15:23:30.646Z caller=main.go:544 msg="Stopping notify discovery manager..."
level=info ts=2020-03-23T15:23:30.646Z caller=main.go:566 msg="Stopping scrape manager..."
level=info ts=2020-03-23T15:23:30.647Z caller=manager.go:845 component="rule manager" msg="Stopping rule manager..."
level=info ts=2020-03-23T15:23:30.647Z caller=manager.go:851 component="rule manager" msg="Rule manager stopped"
level=info ts=2020-03-23T15:23:30.647Z caller=main.go:526 msg="Scrape discovery manager stopped"
level=info ts=2020-03-23T15:23:30.647Z caller=main.go:540 msg="Notify discovery manager stopped"
level=info ts=2020-03-23T15:23:30.647Z caller=main.go:560 msg="Scrape manager stopped"
level=info ts=2020-03-23T15:23:30.648Z caller=notifier.go:598 component=notifier msg="Stopping notification manager..."
level=info ts=2020-03-23T15:23:30.648Z caller=main.go:731 msg="Notifier manager stopped"
level=info ts=2020-03-23T15:23:30.649Z caller=main.go:743 msg="See you next time!"

設置自啓動服務。

vi /etc/systemd/system/prometheus.service
[Unit]
Description=prometheus
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/prometheus/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus/prometheus.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

systemctl start prometheus
systemctl enable prometheus

安裝node_exporter

下載https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.1.linux-amd64.tar.gz。然後上傳node_exporter-0.18.1.linux-amd64.tar.gz並安裝node_exporter。注意在prometheus-node1和prometheus-node2節點都安裝。

tar zxvf node_exporter-0.18.1.linux-amd64.tar.gz
mkdir /usr/local/prometheus
mv node_exporter-0.18.1.linux-amd64 /usr/local/prometheus/node_exporter

查看node-exporter版本號。

/usr/local/prometheus/node_exporter/node_exporter --version
node_exporter, version 0.18.1 (branch: HEAD, revision: 3db77732e925c08f675d7404a8c46466b2ece83e)
  build user:       root@b50852a1acba
  build date:       20190604-16:41:18
  go version:       go1.12.5

設置自啓動服務。

vi /etc/systemd/system/node_exporter.service
[Unit]
Description=node_export
Documentation=https://github.com/prometheus/node_exporter
After=network.target
 
[Service]
Type=simple
User=root
ExecStart=/usr/local/prometheus/node_exporter/node_exporter
Restart=on-failure

[Install]
WantedBy=multi-user.target

systemctl start node_exporter.service
systemctl status node_exporter.service
systemctl enable node_exporter.service

訪問http://192.168.1.69:9100/metrics可以獲取監控指標。

配置Prometheus，添加監控目標

scrape_configs塊控制Prometheus監控的資源。由於Prometheus還將自己的數據公開爲HTTP端點，因此它可以抓取並監控自身的健康狀況。在默認配置中，有一個名爲prometheus的作業，它會抓取Prometheus服務器公開的時間序列數據。該作業包含一個靜態配置的目標，即端口9090上的localhost（此處改爲本機地址192.168.1.69），監控數據從http://192.168.1.69:9090/metrics抓取。

在prometheus-node1和prometheus-node2節點上都安裝了node-exporter，所以相應都配置了job。

vi /usr/local/prometheus/prometheus/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.1.69:9090']
      labels:
        instance: prometheus

  - job_name: 'node1'
    static_configs:
    - targets: ['192.168.1.69:9100']
      labels:
        instance: node1

  - job_name: 'node2'
    static_configs:
    - targets: ['192.168.1.70:9100']
      labels:
        instance: node2

訪問Prometheus查看定義的目標主機http://192.168.1.69:9090/targets。

安裝Alertmanager

下載https://github.com/prometheus/alertmanager/releases/download/v0.20.0/alertmanager-0.20.0.linux-amd64.tar.gz。然後上傳alertmanager-0.20.0.linux-amd64.tar.gz到prometheus-node2，並安裝Alertmanager。

tar zxvf alertmanager-0.20.0.linux-amd64.tar.gz
mv alertmanager-0.20.0.linux-amd64 /usr/local/prometheus/alertmanager

查看alertmanager版本號。

/usr/local/prometheus/alertmanager/alertmanager --version
alertmanager, version 0.20.0 (branch: HEAD, revision: f74be0400a6243d10bb53812d6fa408ad71ff32d)
  build user:       root@00c3106655f8
  build date:       20191211-14:13:14
  go version:       go1.13.5

設置自啓動服務。

vi /etc/systemd/system/alertmanager.service
[Unit]
Description=Alertmanager
After=network.target

[Service]
Type=simple
User=root
ExecStart=/usr/local/prometheus/alertmanager/alertmanager --config.file=/usr/local/prometheus/alertmanager/alertmanager.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

systemctl start alertmanager.service
systemctl status alertmanager.service
systemctl enable alertmanager.service

訪問Alertmanager，http://192.168.1.70:9093/。

修改prometheus.yml將alertmanagers加入監控目標。

vi /usr/local/prometheus/prometheus/prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets: ["192.168.1.70:9093"]

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.1.69:9090','192.168.1.69:9100']
  - job_name: 'node1'
    static_configs:
    - targets: ['192.168.1.70:9100']

systemctl stop prometheus
systemctl start prometheus

可以查看http://192.168.1.69:9090/config看配置是否生效。

配置郵件告警

配置告警規則文件。

vi /usr/local/prometheus/prometheus/prometheus.yml
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

alerting:
  alertmanagers:
  - static_configs:
    - targets: ["192.168.1.70:9093"]

rule_files:
  - /usr/local/prometheus/prometheus/rules/*.rules

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['192.168.1.69:9090']
      labels:
        instance: prometheus

  - job_name: 'node1'
    static_configs:
    - targets: ['192.168.1.69:9100']
      labels:
        instance: node1

  - job_name: 'node2'
    static_configs:
    - targets: ['192.168.1.70:9100']
      labels:
        instance: node2

配置告警規則，"up == 0"表示服務down。

vi /usr/local/prometheus/prometheus/rules/service_down.rules
groups:
- name: ServiceStatus
  rules:
  - alert: ServiceStatusAlert
    expr: up == 0
    for: 1m
    labels:
      project: APP
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

配置Alertmanager。

vi /usr/local/prometheus/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.sina.cn:25'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'yyyyyyy'

templates:
  - 'template/*.tmpl'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 60s
  repeat_interval: 1h
  receiver: 'email'

receivers:
  - name: 'email'
    email_configs:
    - to: '[email protected]'

重啓Prometheus和Alertmanager服務。

systemctl stop prometheus
systemctl start prometheus

systemctl stop alertmanager
systemctl start alertmanager

查詢服務的啓停狀態，可以看出三個監控的服務都是啓動狀態（up == 1表示正常，up == 0表示down）。

在Prometheus中查看alert配置，可以看出ServiceStatusAlert沒有激活。

然後停止prometheus-node2節點的node_exporter服務。

systemctl stop node_exporter

再次查詢服務的啓停狀態，可以看出{instance=“node2”,job=“node2”}服務down了。

node_exporter服務停止後，Prometheus每隔評估週期evaluation_interval（15s）抓取一次，發現告警表達式up == 0爲true之後，Prometheus會先將ServiceStatusAlert變成pengding狀態。

然後執行for子句，在下一個評估週期中，如果告警表達式仍然爲true，則檢查for的持續時間（1m）。如果沒有超過持續時間，則等待下一個評估週期；如果超過了持續時間，則告警轉換爲Firing，生成通知並將其推送到Alertmanager。

如果下一個評估週期告警表達式不再爲true，則Prometheus會將ServiceStatusAlert的狀態從Pending更改回Inactive。

Pending到Firing的轉換可以確保告警更有效，且不會來回浮動。沒有for子句的告警會自動從Inactive轉換爲Firing，只需要一個評估週期即可觸發。帶有for子句的告警將首先轉換爲Pending，然後轉換爲Firing，因此至少需要兩個評估週期才能觸發。

告警可能有以下三種狀態：

Inactive：警報未激活。
Pending：警報已滿足告警表達式條件，但仍在等待for子句中指定的持續時間。
Firing：警報已滿足告警表達式條件，並且Pending的時間已超過for子句的持續時間。

查看郵箱可以看出已經收到告警郵件。

在Alertmanager的http://192.168.1.70:9093/#/alerts可以看到該告警。

安裝Grafana

下載https://dl.grafana.com/oss/release/grafana-6.7.1-1.x86_64.rpm。然後上傳grafana-6.7.1-1.x86_64.rpm到prometheus-node1節點並安裝grafana。

yum install -y grafana-6.7.1-1.x86_64.rpm
systemctl start grafana-server.service
systemctl status grafana-server.service
systemctl enable grafana-server.service

查看版本號。

/usr/local/prometheus/grafana/bin/grafana-server -v

訪問Grafana，http://192.168.1.69:3000/，使用admin/admin登錄。

導入預先構建看板Dashboard

點擊"Add data source"，選擇Prometheus作爲數據源。

在URL輸入框鍵入http://192.168.1.69:9090，點擊"Save & Test"按鈕，如果出現下圖中的綠色"Data source is working"提示，則表示配置有效。

下面我們需要創建Dashboard看板，我們可以從Grafana官方的預先構建Dashboards列表中選擇一個，https://grafana.com/grafana/dashboards，Grafana提供了很多不同數據源的預先構建的Dashboard，我們可以直接使用這些預先構建的Dashboard，而無需自己創建Dashboard。我們選擇第一個並下載下來，文件名爲1-node-exporter-for-prometheus-dashboard-update-1102_rev11.json。