Prometheus監控 - Alertmanager報警模塊

Alertmanager與Prometheus是相互分離的兩個部分。Prometheus服務器根據報警規則將警報發送給Alertmanager，然後Alertmanager將silencing、inhibition、aggregation等消息通過電子郵件、PaperDuty和HipChat發送通知。

設置警報和通知的主要步驟：
(1)安裝配置Alertmanager
(2)配置Prometheus通過-alertmanager.url標誌與Alertmanager通信
(3)在Prometheus中創建告警規則

Alertmanager簡介及機制
Alertmanager處理由類似Prometheus服務器等客戶端發來的警報，之後需要刪除重複、分組，並將它們通過路由發送到正確的接收器，比如電子郵件、Slack等。Alertmanager還支持沉默和警報抑制的機制。

分組
分組是指當出現問題時，Alertmanager會收到一個單一的通知，而當系統宕機時，很有可能成百上千的警報會同時生成，這種機制在較大的中斷中特別有用。

例如，當數十或數百個服務的實例在運行，網絡發生故障時，有可能服務實例的一半不可達數據庫。在告警規則中配置爲每一個服務實例都發送警報的話，那麼結果是數百警報被髮送至Alertmanager。

但是作爲用戶只想看到單一的報警頁面，同時仍然能夠清楚的看到哪些實例受到影響，因此，人們通過配置Alertmanager將警報分組打包，併發送一個相對看起來緊湊的通知。

分組警報、警報時間，以及接收警報的receiver是在配置文件中通過路由樹配置的。

抑制
抑制是指當警報發出後，停止重複發送由此警報引發其他錯誤的警報的機制。

例如，當警報被觸發，通知整個集羣不可達，可以配置Alertmanager忽略由該警報觸發而產生的所有其他警報，這可以防止通知數百或數千與此問題不相關的其他警報。

抑制機制可以通過Alertmanager的配置文件來配置。

沉默
沉默是一種簡單的特定時間靜音提醒的機制。一種沉默是通過匹配器來配置，就像路由樹一樣。傳入的警報會匹配RE，如果匹配，將不會爲此警報發送通知。

沉默機制可以通過Alertmanager的Web頁面進行配置。

Alertmanager的配置
Alertmanager通過命令行flag和一個配置文件進行配置。命令行flag配置不變的系統參數、配置文件定義的禁止規則、通知路由和通知接收器。

要查看所有可用的命令行flag，運行alertmanager -h。

Alertmanager在運行時加載配置，如果不能很好的形成新的配置，更改將不會被應用，並記錄錯誤。

配置文件
要指定加載的配置文件，需要使用-config.file標誌。該文件使用YAML來完成，通過下面的描述來定義。括號內的參數是可選的，對於非列表的參數的值設置爲指定的缺省值。

global:
  # ResolveTimeout is the time after which an alert is declared resolved
  # if it has not been updated.
  [ resolve_timeout: <duration> | default = 5m ]

  # The default SMTP From header field.
  [ smtp_from: <tmpl_string> ]
  # The default SMTP smarthost used for sending emails.
  [ smtp_smarthost: <string> ]

  # The API URL to use for Slack notifications.
  [ slack_api_url: <string> ]

  [ pagerduty_url: <string> | default = "https://events.pagerduty.com/generic/2010-04-15/create_event.json" ]
  [ opsgenie_api_host: <string> | default = "https://api.opsgenie.com/" ]

# Files from which custom notification template definitions are read.
# The last component may use a wildcard matcher, e.g. 'templates/*.tmpl'.
templates:
  [ - <filepath> ... ]

# The root node of the routing tree.
route: <route>

# A list of notification receivers.
receivers:
  - <receiver> ...

# A list of inhibition rules.
inhibit_rules:
  [ - <inhibit_rule> ... ]

路由 route
路由塊定義了路由樹及其子節點。如果沒有設置的話，子節點的可選配置參數從其父節點繼承。

每個警報進入配置的路由樹的頂級路徑，頂級路徑必須匹配所有警報（即沒有任何形式的匹配）。然後匹配子節點。如果continue的值設置爲false，它在匹配第一個孩子後就停止；如果在子節點匹配，continue的值爲true，警報將繼續進行後續兄弟姐妹的匹配。如果警報不匹配任何節點的任何子節點（沒有匹配的子節點，或不存在），該警報基於當前節點的配置處理。

路由配置格式

[ receiver: <string> ]
[ group_by: '[' <labelname>, ... ']' ]

# Whether an alert should continue matching subsequent sibling nodes.
[ continue: <boolean> | default = false ]

# A set of equality matchers an alert has to fulfill to match the node.
match:
  [ <labelname>: <labelvalue>, ... ]

# A set of regex-matchers an alert has to fulfill to match the node.
match_re:
  [ <labelname>: <regex>, ... ]

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
[ group_wait: <duration> ]

# How long to wait before sending notification about new alerts that are
# in are added to a group of alerts for which an initial notification
# has already been sent. (Usually ~5min or more.)
[ group_interval: <duration> ]

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more).
[ repeat_interval: <duration> ]

# Zero or more child routes.
routes:
  [ - <route> ... ]

示例：

# The root route with all parameters, which are inherited by the child
# routes if they are not overwritten.
route:
  receiver: 'default-receiver'
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  group_by: [cluster, alertname]
  # All alerts that do not match the following child routes
  # will remain at the root node and be dispatched to 'default-receiver'.
  routes:
  # All alerts with service=mysql or service=cassandra
  # are dispatched to the database pager.
  - receiver: 'database-pager'
    group_wait: 10s
    match_re:
      service: mysql|cassandra
  # All alerts with the team=frontend label match this sub-route.
  # They are grouped by product and environment rather than cluster
  # and alertname.
  - receiver: 'frontend-pager'
    group_by: [product, environment]
    match:
      team: frontend

抑制規則 inhibit_rule
抑制規則，是存在另一組匹配器匹配的情況下，靜音其他被引發警報的規則。這兩個警報，必須有一組相同的標籤。

抑制配置格式

# Matchers that have to be fulfilled in the alerts to be muted.
target_match:
  [ <labelname>: <labelvalue>, ... ]
target_match_re:
  [ <labelname>: <regex>, ... ]

# Matchers for which one or more alerts have to exist for the
# inhibition to take effect.
source_match:
  [ <labelname>: <labelvalue>, ... ]
source_match_re:
  [ <labelname>: <regex>, ... ]

# Labels that must have an equal value in the source and target
# alert for the inhibition to take effect.
[ equal: '[' <labelname>, ... ']' ]

接收器 receiver
顧名思義，警報接收的配置。

通用配置格式

# The unique name of the receiver.
name: <string>

# Configurations for several notification integrations.
email_configs:
  [ - <email_config>, ... ]
pagerduty_configs:
  [ - <pagerduty_config>, ... ]
slack_config:
  [ - <slack_config>, ... ]
opsgenie_configs:
  [ - <opsgenie_config>, ... ]
webhook_configs:
  [ - <webhook_config>, ... ]

郵件接收器 email_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = false ]

# The email address to send notifications to.
to: <tmpl_string>
# The sender address.
[ from: <tmpl_string> | default = global.smtp_from ]
# The SMTP host through which emails are sent.
[ smarthost: <string> | default = global.smtp_smarthost ]

# The HTML body of the email notification.
[ html: <tmpl_string> | default = '{{ template "email.default.html" . }}' ] 

# Further headers email header key/value pairs. Overrides any headers
# previously set by the notification implementation.
[ headers: { <string>: <tmpl_string>, ... } ]

Slack接收器 slack_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

# The Slack webhook URL.
[ api_url: <string> | default = global.slack_api_url ]

# The channel or user to send notifications to.
channel: <tmpl_string>

# API request data as defined by the Slack webhook API.
[ color: <tmpl_string> | default = '{{ if eq .Status "firing" }}danger{{ else }}good{{ end }}' ]
[ username: <tmpl_string> | default = '{{ template "slack.default.username" . }}'
[ title: <tmpl_string> | default = '{{ template "slack.default.title" . }}' ]
[ title_link: <tmpl_string> | default = '{{ template "slack.default.titlelink" . }}' ]
[ pretext: <tmpl_string> | default = '{{ template "slack.default.pretext" . }}' ]
[ text: <tmpl_string> | default = '{{ template "slack.default.text" . }}' ]
[ fallback: <tmpl_string> | default = '{{ template "slack.default.fallback" . }}' ]

Webhook接收器 webhook_config

# Whether or not to notify about resolved alerts.
[ send_resolved: <boolean> | default = true ]

# The endpoint to send HTTP POST requests to.
url: <string>

Alertmanager會使用以下的格式向配置端點發送HTTP POST請求：

{
  "version": "2",
  "status": "<resolved|firing>",
  "alerts": [
    {
      "labels": <object>,
      "annotations": <object>,
      "startsAt": "<rfc3339>",
      "endsAt": "<rfc3339>"
    },
    ...
  ]
}

報警規則
報警規則允許你定義基於Prometheus語言表達的報警條件，併發送報警通知到外部服務。

定義報警規則
報警規則通過以下格式定義：

ALERT <alert name>
  IF <expression>
  [ FOR <duration> ]
  [ LABELS <label set> ]
  [ ANNOTATIONS <label set> ]

FOR子句使得Prometheus等待第一個傳進來的向量元素（例如高HTTP錯誤的實例），並計數一個警報。如果元素是active，但是沒有firing的，就處於pending狀態。

LABELS（標籤）子句允許指定一組附加的標籤附到警報上。現有的任何標籤都會被覆蓋，標籤值可以被模板化。

ANNOTATIONS（註釋）子句指定另一組未查明警報實例的標籤，它們被用於存儲更長的其他信息，例如警報描述或者鏈接，註釋值可以被模板化。

報警規則示例

# Alert for any instance that is unreachable for >5 minutes.
ALERT InstanceDown
  IF up == 0
  FOR 5m
  LABELS { severity = "page" }
  ANNOTATIONS {
    summary = "Instance {{ $labels.instance }} down",
    description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.",
  }

# Alert for any instance that have a median request latency >1s.
ALERT APIHighRequestLatency
  IF api_http_request_latencies_second{quantile="0.5"} > 1
  FOR 1m
  ANNOTATIONS {
    summary = "High request latency on {{ $labels.instance }}",
    description = "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)",
  }

發送警報通知
Prometheus可以週期性的發送關於警報狀態的信息到Alertmanager實例，然後Alertmanager調度來發送正確的通知。該Alertmanager可以通過-alertmanager.url命令行flag來配置。

Prometheus監控 - Alertmanager報警模塊

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

集羣基礎----（高可用+負載均衡）

nginx的io複用、阻塞非阻塞、同步異步、apache與nginx的區別

集羣基礎------（heartbeat心跳組件）

集羣基礎----（lvs【Linux+virtual+server】）

集羣基礎-----（fence的安裝）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結