TiKV 源碼解析系列文章（十八）Raft Propose 的 Commit 和 Apply 情景分析

在學習了前面的文章之後，相信大家已經對 TiKV 使用的 Raft 核心庫 raft-rs 有了基本的瞭解。raft-rs 實現了 Raft Leader election 和 Log replication 等核心功能，而消息的發送、接收、應用到狀態機等操作則需要使用者自行實現，本文將要介紹的就是 TiKV 中這些部分的處理過程。

Raft Ready

在開始正題之前，我們先簡單回顧一下 raft-rs 與外部代碼的交互接口: Ready。 Ready 結構的定義如下：

pub struct Ready {
    /// The current volatile state of a Node.
    /// SoftState will be nil if there is no update.
    /// It is not required to consume or store SoftState.
    ss: Option<SoftState>,

    /// The current state of a Node to be saved to stable storage BEFORE
    /// Messages are sent.
    /// HardState will be equal to empty state if there is no update.
    hs: Option<HardState>,

    /// States can be used for node to serve linearizable read requests locally
    /// when its applied index is greater than the index in ReadState.
    /// Note that the read_state will be returned when raft receives MsgReadIndex.
    /// The returned is only valid for the request that requested to read.
    read_states: Vec<ReadState>,

    /// Entries specifies entries to be saved to stable storage BEFORE
    /// Messages are sent.
    entries: Vec<Entry>,

    /// Snapshot specifies the snapshot to be saved to stable storage.
    snapshot: Snapshot,

    /// CommittedEntries specifies entries to be committed to a
    /// store/state-machine. These have previously been committed to stable
    /// store.
    pub committed_entries: Option<Vec<Entry>>,

    /// Messages specifies outbound messages to be sent AFTER Entries are
    /// committed to stable storage.
    /// If it contains a MsgSnap message, the application MUST report back to raft
    /// when the snapshot has been received or has failed by calling ReportSnapshot.
    pub messages: Vec<Message>,

    must_sync: bool,
}

Ready 結構包括了一些系列 Raft 狀態的更新，在本文中我們需要關注的是：

hs: Raft 相關的元信息更新，如當前的 term，投票結果，committed index 等等。
committed_entries: 最新被 commit 的日誌，需要應用到狀態機中。
messages: 需要發送給其他 peer 的日誌。
entries: 需要保存的日誌。

Proposal 的接收和在 Raft 中的複製

TiKV 3.0 中引入了類似 Actor 的併發模型，Actor 被視爲併發運算的基本單元：當一個 Actor 接收到一則消息，它可以做出一些決策、創建更多的 Actor、發送更多的消息、決定要如何回答接下來的消息。每個 TiKV 上的 Raft Peer 都對應兩個 Actor，我們把它們分別稱爲 PeerFsm 和 ApplyFsm。PeerFsm 用於接收和處理其他 Raft Peer 發送過來的 Raft 消息，而 ApplyFsm 用於將已提交日誌應用到狀態機。

TiKV 中實現的 Actor System 被稱爲 BatchSystem，它使用幾個 Poll 線程從多個 Mailbox 上拉取一個 Batch 的消息，再分別交由各個 Actor 來執行。爲了保證線性一致性，一個 Actor 同時只會在一個 Poll 線程上接收消息並順序執行。由於篇幅所限，這一部分的實現在這裏不做詳述，感興趣的同學可以在 raftstore/fsm/batch.rs 查看詳細代碼。

上面談到，PeerFsm 用於接收和處理 Raft 消息。它接收的消息爲 PeerMsg，根據消息類型的不同會有不同的處理：

/// Message that can be sent to a peer.
pub enum PeerMsg {
    /// Raft message is the message sent between raft nodes in the same
    /// raft group. Messages need to be redirected to raftstore if target
    /// peer doesn't exist.
    RaftMessage(RaftMessage),
    /// Raft command is the command that is expected to be proposed by the
    /// leader of the target raft group. If it's failed to be sent, callback
    /// usually needs to be called before dropping in case of resource leak.
    RaftCommand(RaftCommand),
    /// Result of applying committed entries. The message can't be lost.
    ApplyRes { res: ApplyTaskRes },
    ...
}

...

impl PeerFsmDelegate {
    pub fn handle_msgs(&mut self, msgs: &mut Vec<PeerMsg>) {
        for m in msgs.drain(..) {
            match m {
                PeerMsg::RaftMessage(msg) => {
                    self.on_raft_message(msg);
                }
                PeerMsg::RaftCommand(cmd) => {
                    self.propose_raft_command(cmd.request, cmd.callback)
                }
                PeerMsg::ApplyRes { res } => {
                    self.on_apply_res(res);
                }
                ...
            }
        }
    }
}

這裏只列出了我們需要關注的幾種消息類型：

RaftMessage: 其他 Peer 發送過來 Raft 消息，包括心跳、日誌、投票消息等。
RaftCommand: 上層提出的 proposal，其中包含了需要通過 Raft 同步的操作，以及操作成功之後需要調用的 callback 函數。
ApplyRes: ApplyFsm 在將日誌應用到狀態機之後發送給 PeerFsm 的消息，用於在進行操作之後更新某些內存狀態。

我們主要關注的是 PeerFsm 如何處理 Proposal，也就是 RaftCommand 的處理過程。在進入到 PeerFsmDelegate::propose_raft_command 後，首先會調用 PeerFsmDelegate::pre_propose_raft_command 對 peer ID, peer term, region epoch (region 的版本，region split、merge 和 add / delete peer 等操作會改變 region epoch) 是否匹配、 peer 是否 leader 等條件進行一系列檢查，並根據請求的類型（是讀請求還是寫請求），選擇不同的 Propose 策略見（ Peer::inspect）：

let policy = self.inspect(&req);
let res = match policy {
    Ok(RequestPolicy::ReadIndex) => return self.read_index(ctx, req, err_resp, cb),
    Ok(RequestPolicy::ProposeNormal) => self.propose_normal(ctx, req),
    ...
};

對於讀請求，我們只需要確認此時 leader 是否真的是 leader 即可，一個較爲輕量的方法是發送一次心跳，再檢查是否收到了過半的響應，這在 raft-rs 中被稱爲 ReadIndex （關於 ReadIndex 的介紹可以參考這篇文章）。對於寫請求，則需要 propose 一條 Raft log，這是在 propose_normal 函數中調用 Raft::propose 接口完成的。在 propose 了一條 log 之後，Peer 會將 proposal 保存在一個名爲 apply_proposals 的 Vec 中。隨後一個 Batch （包含了多個 Peer）內的 proposal 會被 Poll 線程統一收集起來，放入一個名爲 pending_proposals 的 Vec 中待後續處理。

在一個 Batch 的消息都經 PeerDelegate::handle_msgs 處理完畢之後，Poll 對 Batch 內的每一個 Peer 調用 Peer::handle_raft_ready_append：

用記錄的 last_applied_index 獲取一個 Ready。
在得到一個 Ready 之後，PeerFsm 就會像我們前面所描述的那樣，調用 PeerStorage::handle_raft_ready 更新狀態（term，last log index 等）和日誌。
這裏的狀態更新分爲持久化狀態和內存狀態，持久化狀態的更新被寫入到一個 WriteBatch 中，內存狀態的更新則會構造一個 InvokeContext，這些更新都會被一個 PollContext 暫存起來。

於是我們得到了 Batch 內所有 Peer 的狀態更新，以及最近提出的 proposal，隨後 Poll 線程會做以下幾件事情：

將 Proposal 發送給 ApplyFsm 暫存，以便在 Proposal 寫入成功之後調用 Callback 返回響應。
將之前從各個 Ready 中得到的需要發送的日誌發送給 gRPC 線程，隨後發送給其他 TiKV 節點。
持久化已保存在 WriteBatch 中需要更新的狀態。
根據 InvokeContext 更新 PeerFsm 中的內存狀態。
將已提交日誌發送給 ApplyFsm 進行應用（見Peer::handle_raft_ready_apply）。

Proposal 在 Raft 中的確認

上面我們闡述了 Region 的 Leader 在收到 proposal 之後，是調用了哪些接口將 proposal 放到 Raft 狀態機中的。在這之後，這個 proposal 雖然被髮往了 ApplyFsm 中暫存，但是 ApplyFsm 目前還不能 apply 它並調用關聯的 callback 函數，因爲這個 proposal 還沒被 Raft 中的過半節點確認。那麼，Leader 節點上的 raftstore 模塊是如何處理收到的其他副本的 Raft 消息，並完成日誌的確認的呢？

答案就在 PeerFsmDelegate::on_raft_message 函數中。在一個 Peer 收到 Raft 消息之後，會進入這個函數中進行處理，內部調用 Raft::step 函數更新 Raft 狀態機的內存狀態。之後，調用 RawNode::ready 函數獲取 committed_entries，最終作爲 ApplyMsg::Apply 任務發送給 ApplyFsm，由 ApplyFsm 執行指令，如果 proposal 是由本節點發出，還會調用 callback 函數（之前通過 ApplyMsg::Proposal 任務暫存在 ApplyFsm 中）以向客戶端返回響應。

Proposal 的應用

在上一部分我們提到，PeerFsm 會將 Proposal 以及已提交日誌發送給對應的 ApplyFsm，它們對應的消息類型分別是 ApplyMsg::Proposal 和 ApplyMsg::Apply，下面將會講述 ApplyFsm 是如何處理這兩種類型的消息的。

對於 ApplyMsg::Proposal 的處理非常簡單（見 ApplyFsm::handle_proposal），ApplyFsm 會把 Proposal 放入 ApplyDelegate::pending_cmds 中暫存起來，後續在應用對應的日誌時會從這裏找出相應的 Callback 進行調用。

而 ApplyMsg:Apply 中包含的是實際需要應用的日誌，ApplyFsm 會針對這些日誌進行（見 ApplyFsm::handle_apply）：

修改內存狀態，將變更的狀態（last applied index 等）、數據持久化。
調用 Proposal 對應的 Callback 返回響應。
向 PeerFsm 發送 ApplyRes，其中包含了 applied_term、applied_index 等狀態（用於更新 PeerFsm 中的內存狀態）。

這裏存在一個特殊情況，就是所謂的“空日誌”。在 raft-rs 的實現中，當選舉出新的 Leader 時，新 Leader 會廣播一條“空日誌”，以提交前面 term 中的日誌（詳情請見 Raft 論文）。此時，可能還有一些在前面 term 中提出的 proposal 仍然處於 pending 階段，而因爲有新 Leader 產生，這些 proposal 永遠不可能被確認了，因此我們需要對它們進行清理，以免關聯的 callback 無法調用導致一些資源無法釋放。清理的邏輯參照 ApplyFsm::handle_entries_normal 函數。

總結

這裏用一個流程圖總結一下 TiKV 處理 Proposal 的大致流程，如下：

簡言之，TiKV 使用了兩個線程池來處理 Proposal，並且將一個 Raft Peer 分成了兩部分：PeerFsm 和 ApplyFsm。在處理 Proposal 的過程中，首先由 PeerFsm 獲取日誌並驅動 Raft 內部的狀態機，由 ApplyFsm 根據已提交日誌修改對應數據的狀態機（region 信息和用戶數據）。

由於這部分代碼涉及到各種 corner case 的處理，因此邏輯較爲複雜，希望感興趣的讀者可以進一步從源碼中獲取更多細節。

TiKV 源碼解析系列文章（十八）Raft Propose 的 Commit 和 Apply 情景分析

Raft Ready

Proposal 的接收和在 Raft 中的複製

Proposal 在 Raft 中的確認

Proposal 的應用

總結

自學編程兩個月，現在我月入 4 萬元

百度安全多篇議題入選Blackhat Asia以硬技術發現“芯”問題

「實戰應用」如何用圖表控件LightningChart創建2D氣泡圖

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

寫給 TiDB 原廠 DBA 的一封信：連接技術和價值的“最後一米” ｜ PingCAP 招聘季

Hi，你有一份 TiDB 易用性挑戰賽「撈分指南」請查收

TiKV Committer 莊天翼：只要能提升 Codebase 質量，就值得提交 PR

聊聊數據庫的未來，寫在 PingCAP 成立五週年之際

TiDB 4.0 新特性前瞻（四）圖形化診斷界面

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結