前言
作爲運維每天都需要關注主站的5xx,4xx情況以及那個接口的問題,之前的做法是通過nginx 本身的health模塊獲取當前訪問量然後分析之後寫入influxdb裏面,grafana用來讀取 5xx出圖,並且每分鐘把5xx寫入到openfalcon裏面作爲閾值報警但是這樣還是不是實時的,並且出現問題好要自己去過濾nginx日誌太麻煩。
第一版結構
所有的API日誌通過rsyslog 打到logserver上,然後部署 一個logstash對日誌目錄進行分析過過濾出5xx和做GEOIP替換寫入到es裏面,這麼做之後日常沒有出現大規模的問題還好,一旦出現突發的5xx或者4xx事件,就會造成寫入es的延時增大,這個時候快速報警系統從es裏面讀5xx和接口就會失效。
input {
file {
path => "/data/logs/nginx/*/One/*.log"
codec => json
discover_interval => "10"
close_older => "5"
sincedb_path => "/data/logs/.sincedb/.sincedb"
}
}
filter {
if [clientip] != "-" {
geoip {
source => "clientip"
target => "geoip"
database => "/usr/local/logstash-6.5.4/config/GeoIP2-City.mmdb"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]","float"]
}
}
}
output {
if [status] in ["500","502","404","401","403","499","504"] {
elasticsearch {
hosts => ["172.16.8.166:9200"]
index => "logstash-nginx-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
}
第二版結構
前端改爲filebeat全量去讀日誌,每小時日誌大概25G左右,(如果用filebeat官網的性能是跟不上的,後來在網上找到了一個優化過的filebeat,作者實測25MB/s的速度),後面接一個kafka集羣,創建了一個8分區的topic.然後後面掛2個logstash節點,通過下圖可以看到低峯期有8.5K的QPS,峯值大概有13K左右,證明filebeat完全能夠hold的住當前的數據量,那麼如果再有延時的情況,只需要增加logstash節點和es節點就足夠了,擴容很方便,目前2個logstash的讀取能夠完全hold住
filebeat的配置,之前也嘗試過filebeat直接寫多個logstash結果發現性能還沒第一版好
filebeat.registry_file: "/srv/registry"
filebeat.spool_size: 25000
filebeat.publish_async : true
filebeat.queue_size: 10000
filebeat.prospectors:
- input_type: log
paths:
- /data/logs/nginx/*/One/*.log
scan_frequency: 1s
tail_files: true
idle_timeout: 5s
json.keys_under_root: true
json.overwrite_keys: true
harvester_buffer_size: 409600
enabled: true
output.kafka:
hosts: ["in-prod-common-uploadmonitor-1:9092"]
topic: "nginxAccessLog"
enabled: true
logstash配置
input {
kafka {
bootstrap_servers => "in-prod-common-uploadmonitor-1:9092"
client_id => "logstash_nginx_log_group1"
group_id => "logstash_nginx_log_group"
consumer_threads => 2
decorate_events => true
topics => ["nginxAccessLog"]
codec => "json"
}
}
filter {
if [clientip] != "-" {
geoip {
source => "clientip"
target => "geoip"
database => "/usr/local/logstash/config/GeoIP2-City.mmdb"
add_field => [ "[geoip][coordinates]", "%{[geoip][longitude]}" ]
add_field => [ "[geoip][coordinates]", "%{[geoip][latitude]}" ]
}
mutate {
convert => [ "[geoip][coordinates]","float"]
}
}
}
output {
if [status] in ["500","502","404","401","403","499"] {
elasticsearch {
hosts => ["172.16.8.166:9200"]
index => "logstash-nginx-%{+YYYY.MM.dd}"
}
stdout {
codec => rubydebug
}
}
}
kibana實時5xx,4xx地區分佈和主機接口圖
通過報警工具實時去拉最新的5xx進行接口預報,可以做到秒級別的報警,這樣運維的工作就大大的降低了