Zabbix隨心所欲監控ZooKeeper一篇就夠了

以前整理的內容今天整理了一下,重新發一下,滿足你們的各種奇葩的需求
一、 應用場景描述

在目前公司的業務中,沒有太多使用ZooKeeper作爲協同服務的場景。但是我們將使用Codis作爲Redis的集羣部署方案,Codis依賴ZooKeeper來存儲配置信息。所以做好ZooKeeper的監控也很重要。

二 、ZooKeeper監控要點

系統監控

內存使用量 ZooKeeper應當完全運行在內存中,不能使用到SWAP。Java Heap大小不能超過可用內存。

Swap使用量 使用Swap會降低ZooKeeper的性能,設置vm.swappiness = 0

網絡帶寬佔用 如果發現ZooKeeper性能降低關注下網絡帶寬佔用情況和丟包情況,通常情況下ZooKeeper是20%寫入80%讀入

磁盤使用量 ZooKeeper數據目錄使用情況需要注意

磁盤I/O ZooKeeper的磁盤寫入是異步的,所以不會存在很大的I/O請求,如果ZooKeeper和其他I/O密集型服務公用應該關注下磁盤I/O情況

ZooKeeper監控

zk_avg/min/max_latency    響應一個客戶端請求的時間,建議這個時間大於10個Tick就報警

zk_outstanding_requests        排隊請求的數量,當ZooKeeper超過了它的處理能力時,這個值會增大,建議設置報警閥值爲10

zk_packets_received      接收到客戶端請求的包數量

zk_packets_sent        發送給客戶單的包數量,主要是響應和通知

zk_max_file_descriptor_count   最大允許打開的文件數,由ulimit控制

zk_open_file_descriptor_count    打開文件數量,當這個值大於允許值得85%時報警

Mode                運行的角色,如果沒有加入集羣就是standalone,加入集羣式follower或者leader

zk_followers          leader角色纔會有這個輸出,集合中follower的個數。正常的值應該是集合成員的數量減1

zk_pending_syncs       leader角色纔會有這個輸出,pending syncs的數量

zk_znode_count         znodes的數量

zk_watch_count         watches的數量

Java Heap Size         ZooKeeper Java進程的

三、首先你要了解怎麼獲取zookeeper的狀態,具體內容如下

查看哪個節點被選擇作爲follower或者leader 
echo stat|nc 127.0.0.1 2181
測試是否啓動了該Server,若回覆imok表示已經啓動。 
echo ruok|nc 127.0.0.1 2181
列出未經處理的會話和臨時節點。 
echo dump| nc 127.0.0.1 2181
關掉server 
echo kill | nc 127.0.0.1 2181
輸出相關服務配置的詳細信息。 
echo conf | nc 127.0.0.1 2181
列出所有連接到服務器的客戶端的完全的連接 / 會話的詳細信息。 
echo cons | nc 127.0.0.1 2181
輸出關於服務環境的詳細信息(區別於 conf 命令)。 
echo envi |nc 127.0.0.1 2181
列出未經處理的請求。 
echo reqs | nc 127.0.0.1 2181
列出服務器 watch 的詳細信息。 
echo wchs | nc 127.0.0.1 2181
通過 session 列出服務器 watch 的詳細信息,它的輸出是一個與 watch 相關的會話的列表。 
echo wchc | nc 127.0.0.1 2181
通過路徑列出服務器 watch 的詳細信息。它輸出一個與 session 相關的路徑。 
echo wchp | nc 127.0.0.1 2181
#echo ruok|nc 127.0.0.1 2181
imok
#echo mntr|nc 127.0.0.1 2181
zk_version  3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency  0
zk_max_latency  0
zk_min_latency  0
zk_packets_received 11
zk_packets_sent 10
zk_num_alive_connections    1
zk_outstanding_requests 0
zk_server_state leader
zk_znode_count  17159
zk_watch_count  0
zk_ephemerals_count 1
zk_approximate_data_size    6666471
zk_open_file_descriptor_count   29
zk_max_file_descriptor_count    102400
zk_followers    2
zk_synced_followers 2
zk_pending_syncs    0
#echo srvr|nc 127.0.0.1 2181
Zookeeper version: 3.4.6-1569965, built on 02/20/2014 09:09 GMT
Latency min/avg/max: 0/0/0
Received: 26
Sent: 25
Connections: 1
Outstanding: 0
Zxid: 0x500000000
Mode: leader
Node count: 17159

四、 編寫Zabbix監控ZooKeeper的腳本和配置文件

參考了一片文章他用的是zabbix_sender去監控的他的方法我會先介紹,最後邊我修改了他的腳本和模板用zabbix_agent方式去監控在最後邊會給大家
(1)其他作者的文章:要讓Zabbix收集到這些監控數據,有兩種方法一種是每個監控項目通過zabbix agent單獨獲取,主動監控和被動監控都可以。還有一種方法就是將這些監控數據一次性使用zabbix_sender全部發送給zabbix。這裏我們選擇第二種方式。那麼採用zabbix_sender一次性發送全部監控數據的腳本就不能像通過zabbix agent這樣逐個獲取監控項目來編寫腳本。

首先想辦法將監控項目彙集成一個字典,然後遍歷這個字典,將字典中的key:value對通過zabbix_sender的-k和-o參數指定發送出去

echo mntr|nc 127.0.0.1 2181

這條命令可以使用Python的subprocess模塊調用,也可以使用socket模塊去訪問2181端口然後發送命令獲取數據,獲取到mntr執行的數據後還需要將其轉化成爲字典數據

即需要將這種樣式的數據

zk_version 3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency 0
zk_max_latency 0
zk_min_latency 0
zk_packets_received 91
zk_packets_sent 90
zk_num_alive_connections 1
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count 17159
zk_watch_count 0
zk_ephemerals_count 1
zk_approximate_data_size 6666471
zk_open_file_descriptor_count 27
zk_max_file_descriptor_count 102400

轉換成爲這樣的數據

{'zk_followers': 2, 'zk_outstanding_requests': 0, 'zk_approximate_data_size': 6666471, 'zk_packets_sent': 2089, 'zk_pending_syncs': 0, 'zk_avg_latency': 0, 'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 'zk_watch_count': 2, 'zk_packets_received': 2090, 'zk_open_file_descriptor_count': 30, 'zk_server_ruok': 'imok', 'zk_server_state': 'leader', 'zk_synced_followers': 2, 'zk_max_latency': 28, 'zk_num_alive_connections': 2, 'zk_min_latency': 0, 'zk_ephemerals_count': 1, 'zk_znode_count': 17159, 'zk_max_file_descriptor_count': 102400}

到最後需要使用zabbix_sender發送的數據格式這個樣子的

zookeeper.status[zk_version]這是key的名稱

zookeeper.status[zk_outstanding_requests]:0
zookeeper.status[zk_approximate_data_size]:6666471
zookeeper.status[zk_packets_sent]:48
zookeeper.status[zk_avg_latency]:0
zookeeper.status[zk_version]:3.4.6-1569965, built on 02/20/2014 09:09 GMT
zookeeper.status[zk_watch_count]:0
zookeeper.status[zk_packets_received]:49
zookeeper.status[zk_open_file_descriptor_count]:27
zookeeper.status[zk_server_ruok]:imok
zookeeper.status[zk_server_state]:follower
zookeeper.status[zk_max_latency]:0
zookeeper.status[zk_num_alive_connections]:1
zookeeper.status[zk_min_latency]:0
zookeeper.status[zk_ephemerals_count]:1
zookeeper.status[zk_znode_count]:17159
zookeeper.status[zk_max_file_descriptor_count]:102400

精簡代碼如下:

#!/usr/bin/python
import socket
#from StringIO import StringIO
from cStringIO import StringIO
s=socket.socket()
s.connect(('localhost',2181))
s.send('mntr')
data_mntr=s.recv(2048)
s.close()
#print data_mntr
h=StringIO(data_mntr)
result={}
zresult={}
for line in  h.readlines():
    key,value=map(str.strip,line.split('\t'))
    zkey='zookeeper.status' + '[' + key + ']'
    zvalue=value
    result[key]=value
    zresult[zkey]=zvalue
print result
print '\n\n'
print zresult

#python test.py
{'zk_outstanding_requests': '0', 'zk_approximate_data_size': '6666471', 'zk_max_latency': '0', 'zk_avg_latency': '0', 'zk_version': '3.4.6-1569965, built on 02/20/2014 09:09 GMT', 'zk_watch_count': '0', 'zk_num_alive_connections': '1', 'zk_open_file_descriptor_count': '27', 'zk_server_state': 'follower', 'zk_packets_sent': '542', 'zk_packets_received': '543', 'zk_min_latency': '0', 'zk_ephemerals_count': '1', 'zk_znode_count': '17159', 'zk_max_file_descriptor_count': '102400'}

{'zookeeper.status[zk_watch_count]': '0', 'zookeeper.status[zk_avg_latency]': '0', 'zookeeper.status[zk_max_latency]': '0', 'zookeeper.status[zk_approximate_data_size]': '6666471', 'zookeeper.status[zk_server_state]': 'follower', 'zookeeper.status[zk_num_alive_connections]': '1', 'zookeeper.status[zk_min_latency]': '0', 'zookeeper.status[zk_outstanding_requests]': '0', 'zookeeper.status[zk_packets_received]': '543', 'zookeeper.status[zk_ephemerals_count]': '1', 'zookeeper.status[zk_znode_count]': '17159', 'zookeeper.status[zk_packets_sent]': '542', 'zookeeper.status[zk_open_file_descriptor_count]': '27', 'zookeeper.status[zk_max_file_descriptor_count]': '102400', 'zookeeper.status[zk_version]': '3.4.6-1569965, built on 02/20/2014 09:09 GMT'}

詳細代碼如下:

#!/usr/bin/python

""" Check Zookeeper Cluster

zookeeper version should be newer than 3.4.x

#echo mntr|nc 127.0.0.1 2181
zk_version  3.4.6-1569965, built on 02/20/2014 09:09 GMT
zk_avg_latency  0
zk_max_latency  4
zk_min_latency  0
zk_packets_received 84467
zk_packets_sent 84466
zk_num_alive_connections    3
zk_outstanding_requests 0
zk_server_state follower
zk_znode_count  17159
zk_watch_count  2
zk_ephemerals_count 1
zk_approximate_data_size    6666471
zk_open_file_descriptor_count   29
zk_max_file_descriptor_count    102400

#echo ruok|nc 127.0.0.1 2181
imok

"""

import sys
import socket
import re
import subprocess
from StringIO import StringIO
import os

zabbix_sender = '/opt/app/zabbix/sbin/zabbix_sender'
zabbix_conf = '/opt/app/zabbix/conf/zabbix_agentd.conf'
send_to_zabbix = 1

############# get zookeeper server status
class ZooKeeperServer(object):

    def __init__(self, host='localhost', port='2181', timeout=1):
        self._address = (host, int(port))
        self._timeout = timeout
        self._result  = {}

    def _create_socket(self):
        return socket.socket()

    def _send_cmd(self, cmd):
        """ Send a 4letter word command to the server """
        s = self._create_socket()
        s.settimeout(self._timeout)

        s.connect(self._address)
        s.send(cmd)

        data = s.recv(2048)
        s.close()

        return data

    def get_stats(self):
        """ Get ZooKeeper server stats as a map """
        data_mntr = self._send_cmd('mntr')
        data_ruok = self._send_cmd('ruok')
        if data_mntr:
            result_mntr = self._parse(data_mntr)
        if data_ruok:
            result_ruok = self._parse_ruok(data_ruok)

        self._result = dict(result_mntr.items() + result_ruok.items())

        if not self._result.has_key('zk_followers') and not self._result.has_key('zk_synced_followers') and not self._result.has_key('zk_pending_syncs'):

           ##### the tree metrics only exposed on leader role zookeeper server, we just set the followers' to 0
           leader_only = {'zk_followers':0,'zk_synced_followers':0,'zk_pending_syncs':0}    
           self._result = dict(result_mntr.items() + result_ruok.items() + leader_only.items() )

        return self._result  

    def _parse(self, data):
        """ Parse the output from the 'mntr' 4letter word command """
        h = StringIO(data)

        result = {}
        for line in h.readlines():
            try:
                key, value = self._parse_line(line)
                result[key] = value
            except ValueError:
                pass # ignore broken lines

        return result

    def _parse_ruok(self, data):
        """ Parse the output from the 'ruok' 4letter word command """

        h = StringIO(data)

        result = {}

        ruok = h.readline()
        if ruok:
           result['zk_server_ruok'] = ruok

        return result

    def _parse_line(self, line):
        try:
            key, value = map(str.strip, line.split('\t'))
        except ValueError:
            raise ValueError('Found invalid line: %s' % line)

        if not key:
            raise ValueError('The key is mandatory and should not be empty')

        try:
            value = int(value)
        except (TypeError, ValueError):
            pass

        return key, value

    def get_pid(self):
#ps -ef|grep java|grep zookeeper|awk '{print $2}'
         pidarg = '''ps -ef|grep java|grep zookeeper|grep -v grep|awk '{print $2}' ''' 
         pidout = subprocess.Popen(pidarg,shell=True,stdout=subprocess.PIPE)
         pid = pidout.stdout.readline().strip('\n')
         return pid

    def send_to_zabbix(self, metric):
         key = "zookeeper.status[" +  metric + "]"

         if send_to_zabbix > 0:
             #print key + ":" + str(self._result[metric])
             try:

                subprocess.call([zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", str(self._result[metric]) ], stdout=FNULL, stderr=FNULL, shell=False)
             except OSError, detail:
                print "Something went wrong while exectuting zabbix_sender : ", detail
         else:
                print "Simulation: the following command would be execucted :\n", zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", self._result[metric], "\n"

def usage():
        """Display program usage"""

        print "\nUsage : ", sys.argv[0], " alive|all"
        print "Modes : \n\talive : Return pid of running zookeeper\n\tall : Send zookeeper stats as well"
        sys.exit(1)

accepted_modes = ['alive', 'all']

if len(sys.argv) == 2 and sys.argv[1] in accepted_modes:
        mode = sys.argv[1]
else:
        usage()

zk = ZooKeeperServer()
#print zk.get_stats()
pid = zk.get_pid()

if pid != "" and  mode == 'all':
   zk.get_stats()
   #print zk._result
   FNULL = open(os.devnull, 'w')
   for key in zk._result:
       zk.send_to_zabbix(key)
   FNULL.close()
   print pid

elif pid != "" and mode == "alive":
    print pid
else:
    print 0

zabbix配置文件check_zookeeper.conf

UserParameter=zookeeper.status[*],/usr/bin/python /opt/app/zabbix/sbin/check_zookeeper.py $1

重啓agentd完成監控

五、注意上述方法不作爲你們參考,想知道agentd的做法怎麼做嗎?
理論部分不再複述那麼腳本部分

#!/usr/bin/python  
#Author:Lin hu chong chong chong

""" Check Zookeeper Cluster  

zookeeper version should be newer than 3.4.x  

# echo mntr|nc 127.0.0.1 2181  
zk_version  3.4.6-1569965, built on 02/20/2014 09:09 GMT  
zk_avg_latency  0  
zk_max_latency  4  
zk_min_latency  0

zk_packets_received 84467  
zk_packets_sent 84466  
zk_num_alive_connections    3  
zk_outstanding_requests 0  
zk_server_state follower  
zk_znode_count  17159  
zk_watch_count  2  
zk_ephemerals_count 1  
zk_approximate_data_size    6666471  
zk_open_file_descriptor_count   29  
zk_max_file_descriptor_count    102400  

# echo ruok|nc 127.0.0.1 2181  
imok  

"""  
import sys  
import socket  
import re  
import subprocess  
from StringIO import StringIO  
import os  

zabbix_sender = '/data/zabbix/bin/zabbix_sender'  
zabbix_conf = '/data/zabbix/etc/zabbix_agentd.conf'  
send_to_zabbix = 1  

############# get zookeeper server status  
class ZooKeeperServer(object):

    def __init__(self, host='localhost', port='2181', timeout=1):
        self._address = (host, int(port))
        self._timeout = timeout  
        self._result  = {}  

    def _create_socket(self):  
        return socket.socket()  

    def _send_cmd(self, cmd):  
        """ Send a 4letter word command to the server """  
        s = self._create_socket()  
        s.settimeout(self._timeout)  

        s.connect(self._address)  
        s.send(cmd)  

        data = s.recv(2048)  
        s.close()  

        return data  

    def get_stats(self):  
        """ Get ZooKeeper server stats as a map """  
        data_mntr = self._send_cmd('mntr')  
        data_ruok = self._send_cmd('ruok')  
        if data_mntr:  
            result_mntr = self._parse(data_mntr)  
        if data_ruok:  
            result_ruok = self._parse_ruok(data_ruok)  

        self._result = dict(result_mntr.items() + result_ruok.items())  

        if not self._result.has_key('zk_followers') and not self._result.has_key('zk_synced_followers') and not self._result.has_key('zk_pending_syncs'):  

           ##### the tree metrics only exposed on leader role zookeeper server, we just set the followers' to 0  
           leader_only = {'zk_followers':0,'zk_synced_followers':0,'zk_pending_syncs':0}      
           self._result = dict(result_mntr.items() + result_ruok.items() + leader_only.items() )  

        return self._result    

    def _parse(self, data):  
        """ Parse the output from the 'mntr' 4letter word command """  
        h = StringIO(data)  

        result = {}  
        for line in h.readlines():  
            try:  
                key, value = self._parse_line(line)  
                result[key] = value  
            except ValueError:  
                pass # ignore broken lines  

        return result  

    def _parse_ruok(self, data):  
        """ Parse the output from the 'ruok' 4letter word command """  

        h = StringIO(data)  

        result = {}  

        ruok = h.readline()  
        if ruok:  
           result['zk_server_ruok'] = ruok  

        return result  

    def _parse_line(self, line):  
        try:  
            key, value = map(str.strip, line.split('\t'))  
        except ValueError:  
            raise ValueError('Found invalid line: %s' % line)  

        if not key:  
            raise ValueError('The key is mandatory and should not be empty')  

        try:  
            value = int(value)  
        except (TypeError, ValueError):  
            pass  

        return key, value  

    def get_pid(self):  
         arg_dict ={}
         pidarg = '''echo mntr | nc 127.0.0.1 2181 '''
         pidout = subprocess.Popen(pidarg,shell=True,stdout=subprocess.PIPE)
         line = pidout.stdout.readline().strip('\n')
         while line:
             al = line.split("\t")
             if al[0] == 'zk_version':
                 value = al[1][:al[1].find('-')]
                 arg_dict[al[0]] = value
             else:
                arg_dict[al[0]] = al[1]
             line = pidout.stdout.readline().strip('\n')
         pidarg = '''echo srvr | nc 127.0.0.1 2181 '''
         pidout = subprocess.Popen(pidarg, shell=True, stdout=subprocess.PIPE)
         line = pidout.stdout.readline().strip('\n')
         while line:
             al = line.split(":")
             arg_dict[al[0].strip(" ")] = al[1].strip(" ")
             line = pidout.stdout.readline().strip('\n')
         pidarg = '''echo ruok | nc 127.0.0.1 2181 '''
         pidout = subprocess.Popen(pidarg, shell=True, stdout=subprocess.PIPE)
         line = pidout.stdout.readline().strip('\n')
         arg_dict['ruok'] = line
         pidarg = '''ps -ef|grep java|grep zookeeper|grep -v grep|awk '{print $2}' '''
         pidout = subprocess.Popen(pidarg, shell=True, stdout=subprocess.PIPE)
         line = pidout.stdout.readline().strip('\n')
         arg_dict['all'] = line
         arg_dict['alive'] = line
         return arg_dict

    def send_to_zabbix(self, metric):  
         key = "zookeeper.status[" +  metric + "]"  

         if send_to_zabbix > 0:  
             #print key + ":" + str(self._result[metric])  
             try:  

                subprocess.call([zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", str(self._result[metric]) ], stdout=FNULL, stderr=FNULL, shell=False)  
             except OSError, detail:  
                print "Something went wrong while exectuting zabbix_sender : ", detail  
         else:  
                print "Simulation: the following command would be execucted :\n", zabbix_sender, "-c", zabbix_conf, "-k", key, "-o", self._result[metric], "\n"  

def usage():  
        """Display program usage"""  

        print "\nUsage : ", sys.argv[0], " alive|all"  
        print "Modes : \n\talive : Return pid of running zookeeper\n\tall : Send zookeeper stats as well"  
        sys.exit(1)  

if len(sys.argv) == 2:
        mode = sys.argv[1]
else:
        usage()

zk = ZooKeeperServer()
pid = zk.get_pid()

if pid and mode in pid.keys():
    print pid.get(mode)

腳本給大家了添加key

UserParameter=zookeeper.status[*],/usr/bin/python/etc/zabbix/scripts/check_zookeeper.py $1
重啓agentd去服務端zabbix_get一下,返回正常
Zabbix隨心所欲監控ZooKeeper一篇就夠了

六、web上添加模板完成監控
模板修改過的直接去我的百度網盤拉吧:
https://pan.baidu.com/s/1eI-A74h4egXEqgbvO-JiiA 密碼:bxw0

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章