使用open-falcon的人估計都會去折騰該監控系統的報警過程,因爲一個監控系統的核心功能就是監控報警,報警也是監控的最終目的。所以,瞭解一個監控系統的報警原理是每一位使用者必有的好奇心。好像是沒有弄明白一件事,心理層面就會有一根刺插在那,非要把他拔掉一樣。我想這不是對追求知識的執着,而僅僅是強迫症的一種表現。下面,是我對open-falcon報警信息處理過程的分析思路。包括:前期環境的準備、分析過程、處理過程、處理的優化。系統環境: Ubuntu15.04_64bit、open-falcon源碼、redis、mysql、golang、gcc等
1、搭建開發環境
1.1安裝c語言環境
sudo apt-get install build-essential
1.2安裝golang環境
去csdn下載免費的go1.4.2.linux-amd64.tar.gz,進入下載目錄
sudo tar -zxvf go1.4.2.linux-amd64.tar.gz -C /usr/local/
編輯 /etc/profile 文件添加環境變量:sudo vi /etc/profile 追加下面內容到文件末尾:
export GOROOT=/usr/local/go
export GOBIN=$GOROOT/bin
export PATH=$PATH:$GOBIN
export GOPATH=$HOME/goproj
重新加載環境變量:
source /etc/profile
查看golang版本:
go version
1.3安裝redis、mysql
sudo apt-get install mysql-server mysql-client libmysqlclient*
wget http://download.redis.io/releases/redis-3.0.5.tar.gz
tar zxvf redis-3.0.5.tar.gz
cd redis-3.0.5/
sudo apt-get install tcl
make
sudo make install
1.4源碼編譯open-falcon
mkdir $HOME/goproj
cd $HOME/goproj
mkdir -p src/github.com
cd src/github.com
git clone --recursive https://github.com/XiaoMi/open-falcon.git
這裏以安裝alarm模塊爲例子,其他的可以參考官方文檔,我應該也會在博客更新
cd open-falcon/alarm/
sudo chmod 777 /usr/local/go/bin/
go get ./...
./control build
2、報警信息分析
要分析報警信息,首先要產生報警信息。通過用戶界面添加模板,在模板中添加報警規則。例如:內存的空閒空間少於100%即報警。這樣的報警規則肯定會被觸發,需要注意的是添加報警規則的同時,需要設置報警接受用戶組,該用戶組裏面添加相應的用戶。然後,添加的模板需要跟主機組進行綁定,在主機組裏添加相應的被監控機器。最後,坐等報警。
2.1查找redis數據庫
使用redis-cli連接redis數據庫,查詢是否存在報警信息:
key ×
打印如下信息:
1) "session:obj:fe557589a85711e58528000c29bd7b56"
2) "t:uids:4"
3) "foo"
4) "team:obj:5"
5) "team:id:alarm"
6) "user:obj:6"
7) "team:id:alarm_info"
8) "user:obj:7"
9) "user:obj:8"
10) "user:id:admin"
11) "user:obj:11"
12) "t:uids:6"
13) "t:uids:5"
14) "user:obj:1"
15) "user:obj:10"
看到team:id:alarm、team:id:alarm_info時就知道產生的報警信息,當使用 get 命令查詢team:id:alarm、team:id:alarm_info時,返回的並不是報警的信息,所以team:id:alarm、team:id:alarm_info不是報警信息的key。怎麼辦呢?似乎就找不到報警信息了。恩,可以去查看官方文檔是怎麼說的。
2.2閱讀open-falcon文檔
報警信息是由judge模塊產生的,每次產生報警信息都會記錄到redis數據庫,而且詳細劃分報警的等級,那麼爲什麼會沒有報警在redis裏面呢? 來看alarm模塊,每次產生報警信息的時候都會及時上報給用戶,我們也可以在界面上看到完整的報警信息,但是這些信息卻沒有在redis查詢到。那麼,只能開始閱讀以上兩個模塊redis操作的源代碼。
2.3閱讀open-falcon源碼
judge模塊使用LPUSH命令寫報警信息到redis裏面,LPUSH(從隊列的左邊入隊一個或多個元素),把報警信息寫到了redis隊列裏面,等待別的進程獲取。到這已經有點眉目了,如果,隊列裏面的報警信息出隊了,所以redis就查詢不到報警信息。alarm模塊使用BRPOP命令獲取redis裏的報警信息,BRPOP(刪除,並獲得該列表中的最後一個元素,或阻塞,直到有一個可用),把報警信息從redis裏面出隊並且刪除該報警信息。
2.4修改源碼記錄報警信息
judge中redis的報警信息寫日誌:
log.Printf("redis key is %v", redisKey)
log.Printf("redis value is %v", string(bs))
alarm中redis的報警信息寫日誌:
log.Printf("redis key is %v", redisKey)
log.Printf("redis value is %v", string(bs))
重新編譯兩個模塊、重新啓動,坐等報警信息的再次產生。
2.5查看日誌、查看redis
judge記錄到redis的報警信息如下:
2015/12/22 14:12:48 judge.go:82: redis key is event:p0
2015/12/22 14:12:48 judge.go:83: redis value is {"id":"s_7_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":7,"metric":"mem.memfree.percent"," tags":{},"func":"all(#3)","operator":"\u003c=","rightValue":100,"maxStep":3,"priority":0,"note":"memfree alarm test","tpl":{"id":2,"name":"memer y","parentId":0,"actionId":1,"creator":"admin"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":33.335713095833036,"current Step":3,"eventTime":1450764720,"pushedTags":{}}
alarm獲取到redis的報警信息如下:
2015/12/22 14:04:13 reader.go:65: the redis key is: [event:p0 event:p1 event:p2 event:p3 event:p4 event:p5 0]
2015/12/22 14:04:13 reader.go:66: the redis value is: [event:p0 {"id":"s_7_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":7,"metric":"mem.me mfree.percent","tags":{},"func":"all(#3)","operator":"\u003c=","rightValue":100,"maxStep":3,"priority":0,"note":"memfree alarm test","tpl":{"id" :2,"name":"memery","parentId":0,"actionId":1,"creator":"admin"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":38.62473525 142191,"currentStep":1,"eventTime":1450764120,"pushedTags":{}}]
此時,查看redis還是無法獲取報警信息的。如果,想要查詢得到報警信息可以停止alarm模塊,這樣報警信息就會一直存在redis隊列裏面。
3、報警信息處理
3.1停止alarm模塊
停止alarm模塊就可以從redis裏面讀取報警信息了:
進入alarm目錄,命令行輸入:
./control stop
judge記錄到報警信息後,使用redis-cli查詢報警信息:
127.0.0.1:6379> KEYS *
127.0.0.1:6379> KEYS *
1) "event:p0"
127.0.0.1:6379> TYPE event:p0
list
127.0.0.1:6379> lpop event:p0
"{\"id\":\"s_7_9e899684e61cce209c14444cfb4e33bc\",\"strategy\":{\"id\":7,\"metric\":\"mem.memfree.percent\",\"tags\":{},\"func\":\"all(#3)\",\"operator\":\"\\u003c=\",\"rightValue\":100,\"maxStep\":3,\"priority\":0,\"note\":\"memfree alarm test\",\"tpl\":{\"id\":2,\"name\":\"memery\",\"parentId\":0,\"actionId\":1,\"creator\":\"admin\"}},\"expression\":null,\"status\":\"PROBLEM\",\"endpoint\":\"bogon\",\"leftValue\":33.335713095833036,\"currentStep\":3,\"eventTime\":1450764720,\"pushedTags\":{}}"
也可以通過C語言程序獲取該報警信息。
3.2c語言獲取報警信息
連接redis數據庫:
redisContext* conn = redisConnect("127.0.0.1",6379);
獲取報警信息:
redisReply* reply = redisCommand(conn,"BRPOP event:p0 0 ");
或者 redisReply* reply = redisCommand(conn,"RPOP event:p0");
BRPOP、RPOP 是redis出隊列命令,BRPOP是阻塞模式,0表示一直阻塞;RPOP是非阻塞模式。分析報警信息,redisCommand函數返回的redisReply是一個數據結構,如下:
/* This is the reply object returned by redisCommand() */
typedef struct redisReply {
int type; /* REDIS_REPLY_* */
long long integer; /* The integer when type is REDIS_REPLY_INTEGER */
int len; /* Length of string */
char *str; /* Used for both REDIS_REPLY_ERROR and REDIS_REPLY_STRING */
size_t elements; /* number of elements, for REDIS_REPLY_ARRAY */
struct redisReply **element; /* elements vector for REDIS_REPLY_ARRAY */
} redisReply;
其中 type 表示返回結果的類型,包括如下:
#define REDIS_REPLY_STRING 1
#define REDIS_REPLY_ARRAY 2
#define REDIS_REPLY_INTEGER 3
#define REDIS_REPLY_NIL 4
#define REDIS_REPLY_STATUS 5
#define REDIS_REPLY_ERROR 6
而BRPOP、RPOP對應的操作返回#define REDIS_REPLY_ARRAY 2是一個數組類型,處理如下:
for(i = 0; i < reply->elements; ++i){
redisReply* childReply = reply->element[i];
if (childReply->type == REDIS_REPLY_STRING)
printf("The value is %s.\n",childReply->str);
}
運行結果打印信息如下:
The value is event:p0.
The value is {"id":"s_3_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":3,"metric":"mem.memfree.percent","tags":{},"func":"all(#3)","operator":"\u003c","rightValue":100,"maxStep":20,"priority":0,"note":"鍐呭瓨浣跨敤閲忓お澶,"tpl":{"id":3,"name":"local","parentId":0,"actionId":2,"creator":"root"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":27.076937721615383,"currentStep":1,"eventTime":1450851060,"pushedTags":{}}.
注意,需要安裝hiredis客戶端支持c語言操作redis
3.3c語言保存報警信息
結果寫到文檔:
fd=fopen("./alarm_info_log","rw+");
fwrite(childReply->str,1,childReply->len,fd);
寫入成功後查看alarm_info_log內容: cat alarm_info_log
event:p0{"id":"s_3_9e899684e61cce209c14444cfb4e33bc","strategy":{"id":3,"metric":"mem.memfree.percent","tags":{},"func":"all(#3)","operator:"\u003c","rightValue":100,"maxStep":20,"priority":0,"note":"鍐呭瓨浣跨敤閲忓お澶,"tpl":{"id":3,"name":"local","parentId":0,"actionId":2,"creator":"root"}},"expression":null,"status":"PROBLEM","endpoint":"bogon","leftValue":27.076937721615383,"currentStep":1,"eventTime":1450851060,"pushedTags":{}}liang@bogon:~/redis/proc
注意保證alarm_info_log文件可讀寫權限
到此爲止,已經完全的獲取到open-falcon報警時保存在redis數據庫的信息,得知如何獲取該信息之後,就可以開始去做進一步的事情。