慢查詢引起的redis-clusterdown

一、現象

1. 報警：我們的redis私有云有幾百個redis實例，對於每個實例的存活，都有定時任務會每隔5分鐘，重試多次ping命令去檢查它的存活性，比如多次都ping不通，這時候就會收到報警短信： xxx系統-實例(10.10.xx.xx:xxxx)-由運行中變爲心跳停止。這時候我就會去線上看一下出現了哪些問題。

2. 檢測代碼：比較簡易，但是比較有效。

public boolean isRun(final String ip, final int port) {
        boolean isRun = new IdempotentConfirmer() {
            private int timeOutFactor = 1;

            @Override
            public boolean execute() {
                Jedis jedis = new Jedis(ip, port);
                try {
                    jedis.getClient().setConnectionTimeout(Protocol.DEFAULT_TIMEOUT * (timeOutFactor++));
                    jedis.getClient().setSoTimeout(Protocol.DEFAULT_TIMEOUT * (timeOutFactor++));
                    String pong = jedis.ping();
                    return pong != null && pong.equalsIgnoreCase("PONG");
                } catch (JedisDataException e) {
                    String message = e.getMessage();
                    logger.warn(e.getMessage());
                    if (StringUtils.isNotBlank(message) && message.startsWith("LOADING")) {
                        return true;
                    }
                    return false;
                } catch (Exception e) {
                    logger.warn("{}:{} error message is {} ", ip, port, e.getMessage());
                    return false;
                } finally {
                    jedis.close();
                }
            }
        }.run();
        return isRun;
    }

一個簡易的冪等操作器：

package com.sohu.cache.util;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * 冪等操作器
 */
public abstract class IdempotentConfirmer {
    private final Logger logger = LoggerFactory.getLogger(this.getClass());

    private int retry = 3;

    protected IdempotentConfirmer(int retry) {
        this.retry = retry;
    }

    public IdempotentConfirmer() {
    }

    public abstract boolean execute();

    public boolean run() {
        while (retry-- > 0) {
            try {
                boolean isOk = execute();
                if (isOk){
                    return true;
                }
            } catch (Exception e) {
                logger.error(e.getMessage(), e);
                continue;
            }
        }
        return false;
    }
}

二、Redis-Cluster的拓撲結構

1. 拓撲圖：

2. 拓撲表

nodeid	ip:port	replication status
b6073bedf256d45e1dce97cd9242bb4789d52343	10.10.81.94:7497	master
efc1dfb19e120a2f41e0cec3d03d02f43e168238	10.10.79.150:6389	slave
792a356d3cd1474b8f1374aba1e5ceced1c6868d	10.10.81.95:7497	master
472823a5f04ad10027b3cde2672cc33f5accdf99	10.10.79.157:6390	slave
19544b078897d8999b68a41f2538a20cbf84288d	10.10.83.180:6382	master
74d5cda02571651bba185cbc1b10d5810d048ac3	10.10.81.96:6391	slave
b95759d39b6714544917e4aefa383b4de80f871c	10.10.81.97:7499	master
3379f2da275825c9ce4e476820254ccaf7b997d1	10.10.78.52:6398	slave
5711268a4881b723273e618894ba9276b27a297f	10.10.81.98:7496	master
5b2aa80014f168fd26c246104b5f6adf6426275e	10.10.78.53:6396	slave

三、日誌分析：

1. [10:35:09] 10.10.81.97:7499（b95759d39b6714544917e4aefa383b4de80f871c）發現10.10.81.94:7497（b6073bedf256d45e1dce97cd9242bb4789d52343）不可達：

4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).

2.10.10.81.97:7499（b95759d39b6714544917e4aefa383b4de80f871c）通過gossip協議廣播10.10.81.94:7497（b6073bedf256d45e1dce97cd9242bb4789d52343）失敗的信息，其他實例收到了這個流言：

29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343

3.投票：（在redisCluster中，只有Master會參與投票，因爲從節點只是負責備份用），進入新的epoch

9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18

redis源碼：

// 主機已經是一個出錯節點了，自己作爲從機可以升級爲主機
void clusterHandleSlaveFailover(void) {
    ......
    // 故障修復超時，重新啓動故障修復
    if (auth_age > auth_retry_time) { // 兩次故障修復間隔不能過短
        // 更新一些時間
        ......
        redisLog(REDIS_WARNING,
        "Start of election delayed for %lld milliseconds "
        "(rank #%d, offset %lld).",
        server.cluster->failover_auth_time - mstime(),
        server.cluster->failover_auth_rank,
        replicationGetSlaveOffset());
        // 告知其他從機
        /* Now that we have a scheduled election, broadcast our offset
        * to all the other slaves so that they'll updated their offsets
        * if our offset is better. */
        clusterBroadcastPong(CLUSTER_BROADCAST_LOCAL_SLAVES);
        return;
    }
    ......
    // 開頭投票
    /* Ask for votes if needed. */
    if (server.cluster->failover_auth_sent == 0) {
        server.cluster->currentEpoch++;
        server.cluster->failover_auth_epoch = server.cluster->currentEpoch;
        redisLog(REDIS_WARNING,"Starting a failover election for epoch %llu.",
        (unsigned long long) server.cluster->currentEpoch);
        clusterRequestFailoverAuth();
        server.cluster->failover_auth_sent = 1;
        clusterDoBeforeSleep(CLUSTER_TODO_SAVE_CONFIG|
        CLUSTER_TODO_UPDATE_STATE|
        CLUSTER_TODO_FSYNC_CONFIG);
        return; /* Wait for replies. */
    }
}

4. 投票結果:

(1) 10.10.79.150:6389是新的master, 成功執行了failover

3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352).
3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18.
3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master.
3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover

(2) 原來的master:10.10.81.94:7497(主) ，收到了判決書：節點失敗(fail), 授權10.10.79.150:6389爲它的master, 修改自己的cluster nodes, 沒有了slave, cluster status ok, 開始主從複製...

3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail
3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up
3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238
3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost.
3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok
.................

(3)新的主從通信成功，主從關係確定

5. 其他機器開始，認定新的slave

7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

6. 完整日誌

(1). 10.10.81.94:7497(主)

3022:M 26 Nov 10:35:15.560 # Cluster state changed: fail
3022:M 26 Nov 10:35:15.562 # Failover auth denied to efc1dfb19e120a2f41e0cec3d03d02f43e168238: its master is up
3022:M 26 Nov 10:35:15.595 # Configuration change detected. Reconfiguring myself as a replica of efc1dfb19e120a2f41e0cec3d03d02f43e168238
3022:S 26 Nov 10:35:15.595 # Connection with slave 10.10.79.150:6389 lost.
3022:S 26 Nov 10:35:15.616 # Cluster state changed: ok
3022:S 26 Nov 10:35:16.061 * Connecting to MASTER 10.10.79.150:6389
3022:S 26 Nov 10:35:16.061 * MASTER <-> SLAVE sync started
3022:S 26 Nov 10:35:16.062 * Non blocking connect for SYNC fired the event.
3022:S 26 Nov 10:35:16.062 * Master replied to PING, replication can continue...
3022:S 26 Nov 10:35:16.062 * Partial resynchronization not possible (no cached master)
3022:S 26 Nov 10:35:16.062 * Full resync from master: 80ac45a2b236fc44ebbef3d851e535bcbdd6367b:21516181353
3022:S 26 Nov 10:35:45.273 * MASTER <-> SLAVE sync: receiving 494097946 bytes from master
3022:S 26 Nov 10:35:51.052 * MASTER <-> SLAVE sync: Flushing old data
3022:S 26 Nov 10:36:22.165 * MASTER <-> SLAVE sync: Loading DB in memory
3022:S 26 Nov 10:37:44.355 * MASTER <-> SLAVE sync: Finished with success
3022:S 26 Nov 10:37:44.440 * Background append only file rewriting started by pid 4036
3022:S 26 Nov 10:38:34.988 * AOF rewrite child asks to stop sending diffs.
4036:C 26 Nov 10:38:34.988 * Parent agreed to stop sending diffs. Finalizing AOF...
4036:C 26 Nov 10:38:34.988 * Concatenating 0.02 MB of AOF diff received from parent.
4036:C 26 Nov 10:38:34.989 * SYNC append only file rewrite performed
4036:C 26 Nov 10:38:35.029 * AOF rewrite: 5 MB of memory used by copy-on-write
3022:S 26 Nov 10:38:35.127 * Background AOF rewrite terminated with success
3022:S 26 Nov 10:38:35.127 * Residual parent diff successfully flushed to the rewritten AOF (0.00 MB)
3022:S 26 Nov 10:38:35.128 * Background AOF rewrite finished successfully

(2) 10.10.79.150:6389(從)

3173:S 26 Nov 10:35:10.934 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
3173:S 26 Nov 10:35:11.003 # Start of election delayed for 515 milliseconds (rank #0, offset 21516181352).
3173:S 26 Nov 10:35:11.606 # Starting a failover election for epoch 18.
3173:S 26 Nov 10:35:11.660 # Failover election won: I'm the new master.
3173:S 26 Nov 10:35:11.660 # configEpoch set to 18 after successful failover
3173:M 26 Nov 10:35:11.660 # Connection with master lost.
3173:M 26 Nov 10:35:11.660 * Caching the disconnected master state.
3173:M 26 Nov 10:35:11.660 * Discarding previously cached master state.
3173:M 26 Nov 10:35:15.605 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
3173:M 26 Nov 10:35:16.071 * Slave 10.10.81.94:7497 asks for synchronization
3173:M 26 Nov 10:35:16.071 * Full resync requested by slave 10.10.81.94:7497
3173:M 26 Nov 10:35:16.071 * Starting BGSAVE for SYNC with target: disk
3173:M 26 Nov 10:35:16.188 * Background saving started by pid 23531
3173:M 26 Nov 10:35:18.091 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3173:M 26 Nov 10:35:31.056 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
3173:M 26 Nov 10:35:40.040 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
23531:C 26 Nov 10:35:45.048 * DB saved on disk
23531:C 26 Nov 10:35:45.113 * RDB: 6 MB of memory used by copy-on-write
3173:M 26 Nov 10:35:45.281 * Background saving terminated with success
3173:M 26 Nov 10:35:51.059 * Synchronization with slave 10.10.81.94:7497 succeeded
3173:M 26 Nov 10:36:11.107 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
3173:M 26 Nov 10:36:22.277 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(3) 10.10.81.95:7497(主)

14795:M 26 Nov 10:35:10.309 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
14795:M 26 Nov 10:35:11.034 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
14795:M 26 Nov 10:35:14.979 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
14795:M 26 Nov 10:36:20.480 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
14795:M 26 Nov 10:36:31.652 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(4) 10.10.79.157:6390(從)

29177:S 26 Nov 10:35:10.939 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
29177:S 26 Nov 10:35:15.609 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
29177:S 26 Nov 10:36:11.111 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
29177:S 26 Nov 10:36:22.281 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(5) 10.10.83.180:6382(主)

9611:M 26 Nov 10:35:11.342 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
9611:M 26 Nov 10:35:12.068 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
9611:M 26 Nov 10:35:16.013 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
9611:M 26 Nov 10:36:11.516 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
9611:M 26 Nov 10:36:22.686 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(6) 10.10.81.96:6391(從)

7925:S 26 Nov 10:35:10.260 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
7925:S 26 Nov 10:35:14.930 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
7925:S 26 Nov 10:36:20.879 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
7925:S 26 Nov 10:35:33.603 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(7) 10.10.81.97:7499(主)

4051:M 26 Nov 10:35:09.768 * Marking node b6073bedf256d45e1dce97cd9242bb4789d52343 as failing (quorum reached).
4051:M 26 Nov 10:35:10.507 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
4051:M 26 Nov 10:35:14.452 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
4051:M 26 Nov 10:36:09.955 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
4051:M 26 Nov 10:36:21.125 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(8) 10.10.78.52:6398(從)

22972:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
22972:S 26 Nov 10:35:15.635 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
22972:S 26 Nov 10:36:11.116 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
22972:S 26 Nov 10:36:22.286 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(9) 10.10.81.98:7496(主)

2062:M 26 Nov 10:35:10.908 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
2062:M 26 Nov 10:35:11.634 # Failover auth granted to efc1dfb19e120a2f41e0cec3d03d02f43e168238 for epoch 18
2062:M 26 Nov 10:35:15.599 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
2062:M 26 Nov 10:36:11.080 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
2062:M 26 Nov 10:36:22.250 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

(10) 10.10.78.53:6396(從)

19527:S 26 Nov 10:35:10.944 * FAIL message received from b95759d39b6714544917e4aefa383b4de80f871c about b6073bedf256d45e1dce97cd9242bb4789d52343
19527:S 26 Nov 10:35:15.615 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: master without slots is reachable again.
19527:S 26 Nov 10:36:11.117 * FAIL message received from 792a356d3cd1474b8f1374aba1e5ceced1c6868d about b6073bedf256d45e1dce97cd9242bb4789d52343
19527:S 26 Nov 10:36:22.287 * Clear FAIL state for node b6073bedf256d45e1dce97cd9242bb4789d52343: slave is reachable again.

四、原因查找

redis節點被認爲失敗，可能有如下原因：

1. 機器本身問題：負載過高，網絡阻塞，cpu過高等等。

通過tsar命令，查詢機器當時各個指標，未發現有什麼問題：

Time           ---cpu-- ---mem-- ---tcp-- -----traffic---- --sda--- --sdb--- --sdc--- --sdd--- --sde--- --sdf--- --sdg--- --sdh--- --sdi--- --sdj--- --sdk--- --sdl--- --sdm---  ---load- 
Time             util     util   retran    bytin  bytout     util     util     util     util     util     util     util     util     util     util     util     util     util     load1    
26/11/15-10:20   4.07    67.89     0.00     1.7M    1.7M     0.05     0.00     1.75     0.00     8.07     0.00     0.00     0.00     0.00     0.00     0.38     0.25     0.00      1.77   
26/11/15-10:25   4.11    67.89     0.00     1.7M    1.6M     0.01     0.00     1.95     0.00     7.45     0.00     0.00     0.00     0.00     0.00     3.47     0.27     0.00      0.94   
26/11/15-10:30   4.37    67.89     0.00     1.7M    1.7M     0.18     0.00     2.00     0.00     7.79     0.00     0.00     0.00     0.00     0.00     5.31     0.15     0.00      0.94   
26/11/15-10:35   4.86    68.38     0.00     1.8M    1.7M     0.47     0.00     2.40     0.00     7.99     0.00     0.00     0.00     0.00     0.00     3.06     0.15     0.00      1.22   
26/11/15-10:40   7.29    67.66     0.00     2.9M    1.6M     1.55     0.00     2.92     0.00     8.46     0.00     0.00     0.00     0.00     0.00     0.87     0.11     0.00      1.14   
26/11/15-10:45   4.08    67.68     0.00     1.2M    1.6M     0.09     0.00     0.34     0.00     9.12     0.00     0.00     0.00     0.00     0.00     0.25     1.35     0.00      0.46

2. cluster-node-timeout: 這個我們設置的是15秒，一直未發現問題，從別人的分享看，這個數字已經足夠了。

3. 被判斷失敗的redis本身的問題：阻塞操作(比較常見的有: aof阻塞、慢查詢）：從redis的日誌看沒有發生aof阻塞的情況，但是從慢查詢的日誌中，發現瞭如下，瞬間就明白了，許多慢查詢，redis的單線程什麼都堵住了。

五、注意：
本文的日誌時間做過糾正，實際有些機器沒有做時間同步，造成日誌實際對不上，不利排查，而且在分佈式系統中，時間的一致性通常非常重要，所以這個要注意。

參考文獻：

Random and small clusterdown(github redis)
redis集羣定時clusterdown分析

慢查詢引起的redis-clusterdown

SQL優化-20231016

緩存使用與設計系列文章--目錄

好文章列表（持續）

Redis Sentinel安裝、配置和部署

美團在Redis上踩過的一些坑-3.redis內存佔用飆升

慢查詢引起的redis-clusterdown

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結