redis源碼分析 - 複製

在 redis 中,用戶可以通過執行 SLAVEOF 或者通過設置 slaveof 選項,讓一個服務器去複製另一個服務器,我們稱呼這爲主備複製。

查看[redis主從複製]http://blog.csdn.net/honglicu123/article/details/53693395

redis2.8 以上版本的同步,有兩種方式的同步,一種爲完整重同步(full resychronization),另一種是部分重同步(partial resychronization)。PSYNC具有這兩種同步模式。

  • 完整重同步用戶初次同步複製的情況,通過讓主服務器創建併發送RDB文件,以及向從服務器發送保存在緩衝區中的的寫命令來進行同步
  • 部分重同步,則用於處理斷線後重複製的情況。當從服務器與主服務器失去連接後到重新連接主服務器時,如果條件允許,主服務器可以將主從服務器斷開期間執行的寫命令發送給從服務器,從服務器只需要接收並執行這些寫命令,就能將數據庫更新至主服務器當前的狀態,保持主從服務器數據庫狀態一致。

完整重同步的步驟 (full resynchronization)

完整重同步,與舊版redis 中的 SYNC 命令的複製相同,步驟如下:
1) 從服務器向主服務器發送 SYNC 命令。
2) 主服務器接收到從服務器發送的SYNC命令之後,執行 BGSAVE 命令,在後臺生成一個 RDB 文件,並使用一個緩衝區保存從現在開始執行的所有寫命令。
3) 當主服務器的 BGSAVE 命令執行完畢時,主服務器會將生成的 RDB 文件發送給從服務器,從服務器接收並載入這個 RDB 文件,將自己的數據庫狀態更新至主服務器執行 BGSAVE 命令時的數據庫狀態。
4) 主服務器將緩衝區中的所有寫命令發送給從服務器,從服務器接收並執行這些寫命令,將自己的數據庫狀態更新至主服務器當前的數據庫狀態。

完整重同步,能夠很好的完成初次複製和數據同步,但是當從服務器掉線時,如果仍然使用完整重同步,將造成效率低下,佔用大量資源,因爲這時,只需要同步從服務器掉線期間執行的寫命令即可,不需要完整的將整個數據同步一遍。

缺點
1. 主服務器生成 RDB 文件,會佔用大量的 CPU、內存和磁盤 I/O 資源
2. 主服務器發送 RDB 文件,會佔用大量的網絡資源,這可能會對主服務器相應命令請求造成影響。
3. 從服務器接收加載 RDB 文件,載入期間,可能會因爲阻塞而沒辦法處理命令請求。

所以,full resynchronization 是一個非常耗資源的操作,redis 有必要保證只有在真正需要的時候才執行該操作。

部分重同步 (partial resynchronization)

本文以從服務器發送 slaveof 命令爲例說明 PSYNC 的實現。

設置主服務器的地址和端口

當從服務器的客戶端發送 slaveof 命令時,從服務器會將客戶端給定的服務器的 IP 地址和端口號保存在服務器狀態的 masterhostmasterport 屬性裏面:

struct redisServer {
    ...
    /* Replication (slave) */
    char *masterauth;               /* AUTH with this password with master */
    char *masterhost;               /* Hostname of master */
    int masterport;                 /* Port of master */
    int repl_timeout;               /* Timeout after N seconds of master idle */
    redisClient *master;     /* Client that is master for this slave */
    redisClient *cached_master; /* Cached master to be reused for PSYNC. */
    int repl_syncio_timeout; /* Timeout for synchronous I/O calls */
    int repl_state;          /* Replication status if the instance is a slave */
    off_t repl_transfer_size; /* Size of RDB to read from master during sync. */
    off_t repl_transfer_read; /* Amount of RDB read from master during sync. */
    off_t repl_transfer_last_fsync_off; /* Offset when we fsync-ed last time. */
    int repl_transfer_s;     /* Slave -> Master SYNC socket */
    int repl_transfer_fd;    /* Slave -> Master SYNC temp file descriptor */
    char *repl_transfer_tmpfile; /* Slave-> master SYNC temp file name */
    time_t repl_transfer_lastio; /* Unix time of the latest read, for timeout */
    int repl_serve_stale_data; /* Serve stale data when link is down? */
    int repl_slave_ro;          /* Slave is read only? */
    time_t repl_down_since; /* Unix time at which link with master went down */
    int repl_disable_tcp_nodelay;   /* Disable TCP_NODELAY after SYNC? */
    int slave_priority;             /* Reported in INFO and used by Sentinel. */
    char repl_master_runid[REDIS_RUN_ID_SIZE+1];  /* Master run id for PSYNC. */
    long long repl_master_initial_offset;         /* Master PSYNC offset. */
    /* Replication script cache. */
    dict *repl_scriptcache_dict;        /* SHA1 all slaves are aware of. */
    list *repl_scriptcache_fifo;        /* First in, first out LRU eviction. */
    unsigned int repl_scriptcache_size; /* Max number of elements. */
    /* Synchronous replication. */
    list *clients_waiting_acks;         /* Clients waiting in WAIT command. */
    int get_ack_from_slaves;            /* If true we send REPLCONF GETACK. */
    ...
};

slaveof 是一個異步命令,在完成屬性的設置之後,從服務器將向客戶端發送 OK,實際的複製工作將從這開始。

建立套接字連接

SLAVEOF 命令執行結束後,從服務器將根據命令所設置的 IP 地址和端口,創建連向主服務器的套接字連接。

/* Replication cron function, called 1 time per second. */
void replicationCron(void) {
    ...
    /* Check if we should connect to a MASTER */
    if (server.repl_state == REDIS_REPL_CONNECT) {
        redisLog(REDIS_NOTICE,"Connecting to MASTER %s:%d",
            server.masterhost, server.masterport);
        if (connectWithMaster() == REDIS_OK) {
            redisLog(REDIS_NOTICE,"MASTER <-> SLAVE sync started");
        }
    }
    ...
}

int connectWithMaster(void) {
    int fd;

    //create socket connect
    fd = anetTcpNonBlockBestEffortBindConnect(NULL,
        server.masterhost,server.masterport,REDIS_BIND_ADDR);
    if (fd == -1) {
        redisLog(REDIS_WARNING,"Unable to connect to MASTER: %s",
            strerror(errno));
        return REDIS_ERR;
    }

    //create a file event to reponsible for replication between master and slave:
    //比如接收 RDB 文件,接收主服務器傳播來的寫命令
    if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
            AE_ERR)
    {
        close(fd);
        redisLog(REDIS_WARNING,"Can't create readable event for SYNC");
        return REDIS_ERR;
    }

    server.repl_transfer_lastio = server.unixtime;
    server.repl_transfer_s = fd;
    server.repl_state = REDIS_REPL_CONNECTING;
    return REDIS_OK;
}

如果從服務器創建的套接字能成功連接到主服務器,那麼從服務器將會爲這個套接字關聯一個文件事件處理器(syncWithMaster),負責執行後續的複製工作,如接收 RDB 文件,接收服務器傳播來的寫命令等。

發送 PING 命令

從服務器成爲主服務器的客戶端之後,第一件事就是向主服務器發送 PING 命令。

void replicationCron (void)
{
    ...
    /* If we have attached slaves, PING them from time to time.
     * So slaves can implement an explicit timeout to masters, and will
     * be able to detect a link disconnection even if the TCP connection
     * will not actually go down. */
    listIter li;
    listNode *ln;
    robj *ping_argv[1];

    /* First, send PING according to ping_slave_period. */
    if ((replication_cron_loops % server.repl_ping_slave_period) == 0) {
        ping_argv[0] = createStringObject("PING",4);
        replicationFeedSlaves(server.slaves, server.slaveseldb,
            ping_argv, 1);
        decrRefCount(ping_argv[0]);
    }

    /* Second, send a newline to all the slaves in pre-synchronization
     * stage, that is, slaves waiting for the master to create the RDB file.
     * The newline will be ignored by the slave but will refresh the
     * last-io timer preventing a timeout. In this case we ignore the
     * ping period and refresh the connection once per second since certain
     * timeouts are set at a few seconds (example: PSYNC response). */
    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        redisClient *slave = ln->value;

        if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START ||
            (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END &&
             server.rdb_child_type != REDIS_RDB_CHILD_TYPE_SOCKET))
        {
            if (write(slave->fd, "\n", 1) == -1) {
                /* Don't worry, it's just a ping. */
            }
        }
    }

    /* Disconnect timedout slaves. */
    if (listLength(server.slaves)) {
        listIter li;
        listNode *ln;

        listRewind(server.slaves,&li);
        while((ln = listNext(&li))) {
            redisClient *slave = ln->value;

            if (slave->replstate != REDIS_REPL_ONLINE) continue;
            if (slave->flags & REDIS_PRE_PSYNC) continue;
            if ((server.unixtime - slave->repl_ack_time) > server.repl_timeout)
            {
                redisLog(REDIS_WARNING, "Disconnecting timedout slave: %s",
                    replicationGetSlaveName(slave));
                freeClient(slave);
            }
        }
    }
    ...
}

void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
    ...
    /* Send a PING to check the master is able to reply without errors. */
    if (server.repl_state == REDIS_REPL_CONNECTING) {
        redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
        /* Delete the writable event so that the readable event remains
         * registered and we can wait for the PONG reply. */
        aeDeleteFileEvent(server.el,fd,AE_WRITABLE);
        server.repl_state = REDIS_REPL_RECEIVE_PONG;
        /* Send the PING, don't check for errors at all, we have the timeout
         * that will take care about this. */
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PING",NULL);
        if (err) goto write_error;
        return;
    }
    ...
}

PING命令的作用:

  • 檢查套接字的讀寫狀態是否正常
  • 檢查主服務器能否正常處理命令請求

如果從服務器讀取到 “PONG” 回覆,說明主從之間網絡狀態正常,能夠進行後續的複製工作,從服務器可以繼續執行復制操作的下一個步驟。其他異常情況下,從服務器將斷開主服務器的連接,並重新創建連向主服務器的套接字。

/* Receive the PONG command. */
if (server.repl_state == REDIS_REPL_RECEIVE_PONG) {
    err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);

    /* We accept only two replies as valid, a positive +PONG reply
     * (we just check for "+") or an authentication error.
     * Note that older versions of Redis replied with "operation not
     * permitted" instead of using a proper error code, so we test
     * both. */
    if (err[0] != '+' &&
        strncmp(err,"-NOAUTH",7) != 0 &&
        strncmp(err,"-ERR operation not permitted",28) != 0)
    {
        redisLog(REDIS_WARNING,"Error reply to PING from master: '%s'",err);
        sdsfree(err);
        goto error;
    } else {
        redisLog(REDIS_NOTICE,
            "Master replied to PING, replication can continue...");
    }
    sdsfree(err);
    server.repl_state = REDIS_REPL_SEND_AUTH;
}

身份驗證

/* AUTH with the master if required. */
    if (server.repl_state == REDIS_REPL_SEND_AUTH) {
        if (server.masterauth) {    // "AUTH server.masterauth"
            err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"AUTH",server.masterauth,NULL);
            if (err) goto write_error;
            server.repl_state = REDIS_REPL_RECEIVE_AUTH;
            return;
        } else {
            server.repl_state = REDIS_REPL_SEND_PORT;
        }
    }

    /* Receive AUTH reply. */
    if (server.repl_state == REDIS_REPL_RECEIVE_AUTH) {
        err = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
        if (err[0] == '-') {
            redisLog(REDIS_WARNING,"Unable to AUTH to MASTER: %s",err);
            sdsfree(err);
            goto error;
        }
        sdsfree(err);
        server.repl_state = REDIS_REPL_SEND_PORT;
    }

從服務器設置了 masterauth 選項,將進行身份驗證,否則,不會進行身份驗證。但是會出現以下幾種情況:

  • 主服務器沒設置 requirepass 選項,從服務器沒有設置 masterauth,主服務能夠繼續執行從服務器發送的命令請求,複製工作可以繼續進行。
  • 如果從服務器發送的驗證密碼與主服務器相同,能夠繼續進行復制工作;否則,主服務器將返回一個 invalid password 的錯誤
  • 主服務器設置了 requirepass 選項,從服務器沒有設置 masterauth 選項,那麼主服務器將返回一個 NOAUTH 的錯誤;相反,如果主服務器沒有設置 requirepass,而從服務器缺設置了 masterauth,那麼主服務器將返回一個 no password is set 的錯誤信息。

發送端口信息

    /* Set the slave port, so that Master's INFO command can list the
     * slave listening port correctly. */
    if (server.repl_state == REDIS_REPL_SEND_PORT) {
        sds port = sdsfromlonglong(server.port);
        err = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"REPLCONF",
                "listening-port",port, NULL);   // "REPLCONF listening-port 6379"
        sdsfree(port);
        if (err) goto write_error;
        sdsfree(err);
        server.repl_state = REDIS_REPL_RECEIVE_PORT;
        return;
    }

從服務器發送 REPLCONF listening-port <port> ,向主服務器發送從服務器的監聽端口號。主服務器接收後,會將端口號記錄在從服務器對應的客戶端狀態結構體中的 slave_listening_port 屬性中,在客戶端執行 INFO REPLICATION 命令查看到的 port 參數的值就是這個屬性的值。

同步

    /* Try a partial resynchonization. If we don't have a cached master
     * slaveTryPartialResynchronization() will at least try to use PSYNC
     * to start a full resynchronization so that we get the master run id
     * and the global offset, to try a partial resync at the next
     * reconnection attempt. */
    if (server.repl_state == REDIS_REPL_SEND_PSYNC) {
        if (slaveTryPartialResynchronization(fd,0) == PSYNC_WRITE_ERROR) {
            err = sdsnew("Write error sending the PSYNC command.");
            goto write_error;
        }
        server.repl_state = REDIS_REPL_RECEIVE_PSYNC;
        return;
    }

    /* If reached this point, we should be in REDIS_REPL_RECEIVE_PSYNC. */
    if (server.repl_state != REDIS_REPL_RECEIVE_PSYNC) {
        redisLog(REDIS_WARNING,"syncWithMaster(): state machine error, "
                             "state should be RECEIVE_PSYNC but is %d",
                             server.repl_state);
        goto error;
    }

    psync_result = slaveTryPartialResynchronization(fd,1);
    if (psync_result == PSYNC_WAIT_REPLY) return; /* Try again later... */

    /* Note: if PSYNC does not return WAIT_REPLY, it will take care of
     * uninstalling the read handler from the file descriptor. */

    if (psync_result == PSYNC_CONTINUE) {
        redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.");
        return;
    }

    /* PSYNC failed or is not supported: we want our slaves to resync with us
     * as well, if we have any (chained replication case). The mater may
     * transfer us an entirely different data set and we have no way to
     * incrementally feed our slaves after that. */
    disconnectSlaves(); /* Force our slaves to resync with us as well. */
    freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */

    /* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
     * and the server.repl_master_runid and repl_master_initial_offset are
     * already populated. */
    if (psync_result == PSYNC_NOT_SUPPORTED) {
        redisLog(REDIS_NOTICE,"Retrying with SYNC...");
        if (syncWrite(fd,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
            redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",
                strerror(errno));
            goto error;
        }
    }

    /* Prepare a suitable temp file for bulk transfer */
    while(maxtries--) {
        snprintf(tmpfile,256,
            "temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
        dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);
        if (dfd != -1) break;
        sleep(1);
    }
    if (dfd == -1) {
        redisLog(REDIS_WARNING,"Opening the temp file needed for MASTER <-> SLAVE synchronization: %s",strerror(errno));
        goto error;
    }

    /* Setup the non blocking download of the bulk file. */
    if (aeCreateFileEvent(server.el,fd, AE_READABLE,readSyncBulkPayload,NULL)
            == AE_ERR)
    {
        redisLog(REDIS_WARNING,
            "Can't create readable event for SYNC: %s (fd=%d)",
            strerror(errno),fd);
        goto error;
    }

按照上文代碼中的註釋,如果是初次複製,we don't have a cached master,採用的是 full resynchronization,獲取 master run id and the global offset。如果是斷線重連複製,使用的部分重複制 partial resynchronization。使用 full resynchronization 時,接收主服務器發送的 RDB 文件。

#define PSYNC_WRITE_ERROR 0
#define PSYNC_WAIT_REPLY 1
#define PSYNC_CONTINUE 2
#define PSYNC_FULLRESYNC 3
#define PSYNC_NOT_SUPPORTED 4
int slaveTryPartialResynchronization(int fd, int read_reply) {
    char *psync_runid;
    char psync_offset[32];
    sds reply;

    /* Writing half */
    if (!read_reply) {
        /* Initially set repl_master_initial_offset to -1 to mark the current
         * master run_id and offset as not valid. Later if we'll be able to do
         * a FULL resync using the PSYNC command we'll set the offset at the
         * right value, so that this information will be propagated to the
         * client structure representing the master into server.master. */
        server.repl_master_initial_offset = -1;

        if (server.cached_master) {
            psync_runid = server.cached_master->replrunid;
            snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
            redisLog(REDIS_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_runid, psync_offset);
        } else {
            redisLog(REDIS_NOTICE,"Partial resynchronization not possible (no cached master)");
            psync_runid = "?";
            memcpy(psync_offset,"-1",3);
        }

        /* Issue the PSYNC command */
        /* PSYNC ? -1 */
        reply = sendSynchronousCommand(SYNC_CMD_WRITE,fd,"PSYNC",psync_runid,psync_offset,NULL);
        if (reply != NULL) {
            redisLog(REDIS_WARNING,"Unable to send PSYNC to master: %s",reply);
            sdsfree(reply);
            aeDeleteFileEvent(server.el,fd,AE_READABLE);
            return PSYNC_WRITE_ERROR;
        }
        return PSYNC_WAIT_REPLY;
    }

    /* Reading half */
    reply = sendSynchronousCommand(SYNC_CMD_READ,fd,NULL);
    if (sdslen(reply) == 0) {
        /* The master may send empty newlines after it receives PSYNC
         * and before to reply, just to keep the connection alive. */
        sdsfree(reply);
        return PSYNC_WAIT_REPLY;
    }

    aeDeleteFileEvent(server.el,fd,AE_READABLE);

    if (!strncmp(reply,"+FULLRESYNC",11)) {
        char *runid = NULL, *offset = NULL;

        /* FULL RESYNC, parse the reply in order to extract the run id
         * and the replication offset. */
        runid = strchr(reply,' ');
        if (runid) {
            runid++;
            offset = strchr(runid,' ');
            if (offset) offset++;
        }
        if (!runid || !offset || (offset-runid-1) != REDIS_RUN_ID_SIZE) {
            redisLog(REDIS_WARNING,
                "Master replied with wrong +FULLRESYNC syntax.");
            /* This is an unexpected condition, actually the +FULLRESYNC
             * reply means that the master supports PSYNC, but the reply
             * format seems wrong. To stay safe we blank the master
             * runid to make sure next PSYNCs will fail. */
            memset(server.repl_master_runid,0,REDIS_RUN_ID_SIZE+1);
        } else {
            memcpy(server.repl_master_runid, runid, offset-runid-1);
            server.repl_master_runid[REDIS_RUN_ID_SIZE] = '\0';
            server.repl_master_initial_offset = strtoll(offset,NULL,10);
            redisLog(REDIS_NOTICE,"Full resync from master: %s:%lld",
                server.repl_master_runid,
                server.repl_master_initial_offset);
        }
        /* We are going to full resync, discard the cached master structure. */
        replicationDiscardCachedMaster();
        sdsfree(reply);
        return PSYNC_FULLRESYNC;
    }

    if (!strncmp(reply,"+CONTINUE",9)) {
        /* Partial resync was accepted, set the replication state accordingly */
        redisLog(REDIS_NOTICE,
            "Successful partial resynchronization with master.");
        sdsfree(reply);
        replicationResurrectCachedMaster(fd);
        return PSYNC_CONTINUE;
    }

    /* If we reach this point we received either an error since the master does
     * not understand PSYNC, or an unexpected reply from the master.
     * Return PSYNC_NOT_SUPPORTED to the caller in both cases. */

    if (strncmp(reply,"-ERR",4)) {
        /* If it's not an error, log the unexpected event. */
        redisLog(REDIS_WARNING,
            "Unexpected reply to PSYNC from master: %s", reply);
    } else {
        redisLog(REDIS_NOTICE,
            "Master does not support PSYNC or is in "
            "error state (reply: %s)", reply);
    }
    sdsfree(reply);
    replicationDiscardCachedMaster();
    return PSYNC_NOT_SUPPORTED;
}

slaveTryPartialResynchronization 函數描述了主服務器接收到 PSYNC 命令時,返回給從服務器的幾種情況。 如果從服務器與主服務器是初次複製,或者之前執行過 slaveof no one 命令,那麼從服務器將向主服務器發送 PSYNC ? -1 命令,請求進行完整重複制;否則,從服務器向主服務器發送 PSYNC <runid> <offset> 命令,請求進行部分重同步

  • 如果主服務器返回 +FULLRESYNC <runid> <offset> 回覆,表示主從將執行完整重同步。 runid 爲主服務的 runid,從服務器保存這個值,用於下次發送 PSYNC 命令時使用,offset 是主服務器當前的複製偏移量,從服務器會將這個值作爲自己的初始化偏移值。
  • 如果主服務器返回 +CONTINUE ,進行部分重同步
  • 返回 -ERR,表示主服務器版本低於 2.8,不能識別 PSYNC 命令,使用 SYNC 進行完整重同步操作。

命令傳播

當完成同步之後,主從服務器就會進入命令傳播階段。這時,主服務器只要一直將自己執行的寫命令發送給從服務器,從服務器只需要一直接收和執行主服務器發送過來的寫命令,就可以保證主從服務器數據庫狀態一致了。

void syncCommand (redisClient* c)
{
    ...
    /* Try a partial resynchronization if this is a PSYNC command.
     * If it fails, we continue with usual full resynchronization, however
     * when this happens masterTryPartialResynchronization() already
     * replied with:
     *
     * +FULLRESYNC <runid> <offset>
     *
     * So the slave knows the new runid and offset to try a PSYNC later
     * if the connection with the master is lost. */
    if (!strcasecmp(c->argv[0]->ptr,"psync")) {
        if (masterTryPartialResynchronization(c) == REDIS_OK) {
            server.stat_sync_partial_ok++;
            return; /* No full resync needed, return. */
        } else {
            char *master_runid = c->argv[1]->ptr;

            /* Increment stats for failed PSYNCs, but only if the
             * runid is not "?", as this is used by slaves to force a full
             * resync on purpose when they are not albe to partially
             * resync. */
            if (master_runid[0] != '?') server.stat_sync_partial_err++;
        }
    } else {
        /* If a slave uses SYNC, we are dealing with an old implementation
         * of the replication protocol (like redis-cli --slave). Flag the client
         * so that we don't expect to receive REPLCONF ACK feedbacks. */
        c->flags |= REDIS_PRE_PSYNC;
    }
    ...
}

複製積壓緩衝區,就是一個循環數組,可以看成是一個隊列,通過先進先出的方式,如果數組滿了,會將最開始的那部分覆蓋。

/* Feed the slave 'c' with the replication backlog starting from the
 * specified 'offset' up to the end of the backlog. */
long long addReplyReplicationBacklog(redisClient *c, long long offset) {
    long long j, skip, len;

    redisLog(REDIS_DEBUG, "[PSYNC] Slave request offset: %lld", offset);

    if (server.repl_backlog_histlen == 0) {
        redisLog(REDIS_DEBUG, "[PSYNC] Backlog history len is zero");
        return 0;
    }

    redisLog(REDIS_DEBUG, "[PSYNC] Backlog size: %lld",
             server.repl_backlog_size);
    redisLog(REDIS_DEBUG, "[PSYNC] First byte: %lld",
             server.repl_backlog_off);
    redisLog(REDIS_DEBUG, "[PSYNC] History len: %lld",
             server.repl_backlog_histlen);
    redisLog(REDIS_DEBUG, "[PSYNC] Current index: %lld",
             server.repl_backlog_idx);

    /* Compute the amount of bytes we need to discard. */
    skip = offset - server.repl_backlog_off;
    redisLog(REDIS_DEBUG, "[PSYNC] Skipping: %lld", skip);

    /* Point j to the oldest byte, that is actaully our
     * server.repl_backlog_off byte. */
    j = (server.repl_backlog_idx +
        (server.repl_backlog_size-server.repl_backlog_histlen)) %
        server.repl_backlog_size;
    redisLog(REDIS_DEBUG, "[PSYNC] Index of first byte: %lld", j);

    /* Discard the amount of data to seek to the specified 'offset'. */
    j = (j + skip) % server.repl_backlog_size;

    /* Feed slave with data. Since it is a circular buffer we have to
     * split the reply in two parts if we are cross-boundary. */
    len = server.repl_backlog_histlen - skip;
    redisLog(REDIS_DEBUG, "[PSYNC] Reply total length: %lld", len);
    while(len) {
        long long thislen =
            ((server.repl_backlog_size - j) < len) ?
            (server.repl_backlog_size - j) : len;

        redisLog(REDIS_DEBUG, "[PSYNC] addReply() length: %lld", thislen);
        addReplySds(c,sdsnewlen(server.repl_backlog + j, thislen));
        len -= thislen;
        j = 0;
    }
    return server.repl_backlog_histlen - skip;
}

心跳檢測

在命令傳播階段,從服務器會默認以每秒一次的頻率,向主服務器發送命令:

REPLCONF ACK <replication_offset>

replication_offset 是從服務器當前的複製偏移量。發送該命令的作用:

  • 檢測主從服務器的網絡連接狀態
  • 輔助實現 min-slaves
  • 檢測命令丟失

replication.c 中的 replicationCron 函數每秒執行一次,

void replicationCron (void)
{
    ...
    /* Send ACK to master from time to time.
     * Note that we do not send periodic acks to masters that don't
     * support PSYNC and replication offsets. */
    if (server.masterhost && server.master &&
        !(server.master->flags & REDIS_PRE_PSYNC))
        replicationSendAck();

    ...
}

從中可知, redis 從服務器會每秒向主服務器發送一次 ACK

/* Send a REPLCONF ACK command to the master to inform it about the current
 * processed offset. If we are not connected with a master, the command has
 * no effects. */
void replicationSendAck(void) {
    redisClient *c = server.master;

    if (c != NULL) {
        c->flags |= REDIS_MASTER_FORCE_REPLY;
        addReplyMultiBulkLen(c,3);
        addReplyBulkCString(c,"REPLCONF");
        addReplyBulkCString(c,"ACK");
        addReplyBulkLongLong(c,c->reploff);
        c->flags &= ~REDIS_MASTER_FORCE_REPLY;
    }
}

reploff 是從服務器的複製偏移量

檢測主從服務器的網絡連接狀態

如果主服務器超過1秒鐘沒有接收到從服務器發送的 REPLCONF ACK 命令,那麼主服務器就認爲主從服務器之間的網絡連接出現了問題。

通過向主服務器發送 INFO REPLICATION ,在列出的參數說明的 lag 一欄中,就表示從服務器最後一次向主服務器發送 REPLCONF ACK 命令距離現在過了多少秒。

輔助實現 min-slaves

在 redis 配置文件中,

min-slaves-to-write 3
min-slaves-max-lag 10

這兩個參數,require at least 3 slaves with a lag <= 10 seconds,也就是說,當從服務器的數量少於三個或者三個從服務器的延遲 (lag) 都大於等於 10 秒時,主服務器將拒絕執行寫命令。

在 redis.c 的 processCommand 函數中實現

int processCommand (redisClient *c)
{
    ...
    /* Don't accept write commands if there are not enough good slaves and
     * user configured the min-slaves-to-write option. */
    if (server.masterhost == NULL &&
        server.repl_min_slaves_to_write &&
        server.repl_min_slaves_max_lag &&
        c->cmd->flags & REDIS_CMD_WRITE &&
        server.repl_good_slaves_count < server.repl_min_slaves_to_write)
    {
        flagTransaction(c);
        addReply(c, shared.noreplicaserr);  //-NOREPLICAS Not enough good slaves to write.\r\n
        return REDIS_OK;
    }
    ...
}

如果不滿足條件,主服務器將返回 -NOREPLICAS Not enough good slaves to write.

在 replication.c 的 refreshGoodSlavesCount(void) 函數中,會對 repl_good_slaves_count 這個屬性進行更新。

/* This function counts the number of slaves with lag <= min-slaves-max-lag.
 * If the option is active, the server will prevent writes if there are not
 * enough connected slaves with the specified lag (or less). */
void refreshGoodSlavesCount(void) {
    listIter li;
    listNode *ln;
    int good = 0;

    if (!server.repl_min_slaves_to_write ||
        !server.repl_min_slaves_max_lag) return;

    listRewind(server.slaves,&li);
    while((ln = listNext(&li))) {
        redisClient *slave = ln->value;
        time_t lag = server.unixtime - slave->repl_ack_time;

        if (slave->replstate == REDIS_REPL_ONLINE &&
            lag <= server.repl_min_slaves_max_lag) good++;
    }
    server.repl_good_slaves_count = good;
}

檢測命令丟失

參考文章

  1. redis 設計與實現(黃健宏)
  2. redis複製設計思想 http://antirez.com/news/106
  3. redis複製設計思想 http://antirez.com/news/31
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章