PostgreSQL 源碼解讀（115）- 後臺進程#3（checkpointer進程#2）

本節簡單介紹了PostgreSQL的後臺進程:checkpointer,主要分析CreateCheckPoint函數的實現邏輯。

一、數據結構

CheckPoint
CheckPoint XLOG record結構體.

/*
 * Body of CheckPoint XLOG records.  This is declared here because we keep
 * a copy of the latest one in pg_control for possible disaster recovery.
 * Changing this struct requires a PG_CONTROL_VERSION bump.
 * CheckPoint XLOG record結構體.
 * 在這裏聲明是因爲我們在pg_control中保存了最新的副本，
 *   以便進行可能的災難恢復。
 * 改變這個結構體需要一個PG_CONTROL_VERSION bump。
 */
typedef struct CheckPoint
{
    //在開始創建CheckPoint時下一個可用的RecPtr(比如REDO的開始點)
    XLogRecPtr  redo;           /* next RecPtr available when we began to
                                 * create CheckPoint (i.e. REDO start point) */
    //當前的時間線
    TimeLineID  ThisTimeLineID; /* current TLI */
    //上一個時間線(如該記錄正在開啓一條新的時間線,否則等於當前時間線)
    TimeLineID  PrevTimeLineID; /* previous TLI, if this record begins a new
                                 * timeline (equals ThisTimeLineID otherwise) */
    //是否full-page-write
    bool        fullPageWrites; /* current full_page_writes */
    //nextXid的高階位
    uint32      nextXidEpoch;   /* higher-order bits of nextXid */
    //下一個free的XID
    TransactionId nextXid;      /* next free XID */
    //下一個free的OID
    Oid         nextOid;        /* next free OID */
    //下一個fredd的MultiXactId
    MultiXactId nextMulti;      /* next free MultiXactId */
    //下一個空閒的MultiXact偏移
    MultiXactOffset nextMultiOffset;    /* next free MultiXact offset */
    //集羣範圍內的最小datfrozenxid
    TransactionId oldestXid;    /* cluster-wide minimum datfrozenxid */
    //最小datfrozenxid所在的database
    Oid         oldestXidDB;    /* database with minimum datfrozenxid */
    //集羣範圍內的最小datminmxid
    MultiXactId oldestMulti;    /* cluster-wide minimum datminmxid */
    //最小datminmxid所在的database
    Oid         oldestMultiDB;  /* database with minimum datminmxid */
    //checkpoint的時間戳
    pg_time_t   time;           /* time stamp of checkpoint */
    //帶有有效提交時間戳的最老Xid
    TransactionId oldestCommitTsXid;    /* oldest Xid with valid commit
                                         * timestamp */
    //帶有有效提交時間戳的最新Xid
    TransactionId newestCommitTsXid;    /* newest Xid with valid commit
                                         * timestamp */

    /*
     * Oldest XID still running. This is only needed to initialize hot standby
     * mode from an online checkpoint, so we only bother calculating this for
     * online checkpoints and only when wal_level is replica. Otherwise it's
     * set to InvalidTransactionId.
     * 最老的XID還在運行。
     * 這只需要從online checkpoint初始化熱備模式，因此我們只需要爲在線檢查點計算此值，
     *   並且只在wal_level是replica時才計算此值。
     * 否則它被設置爲InvalidTransactionId。
     */
    TransactionId oldestActiveXid;
} CheckPoint;

/* XLOG info values for XLOG rmgr */
#define XLOG_CHECKPOINT_SHUTDOWN        0x00
#define XLOG_CHECKPOINT_ONLINE          0x10
#define XLOG_NOOP                       0x20
#define XLOG_NEXTOID                    0x30
#define XLOG_SWITCH                     0x40
#define XLOG_BACKUP_END                 0x50
#define XLOG_PARAMETER_CHANGE           0x60
#define XLOG_RESTORE_POINT              0x70
#define XLOG_FPW_CHANGE                 0x80
#define XLOG_END_OF_RECOVERY            0x90
#define XLOG_FPI_FOR_HINT               0xA0
#define XLOG_FPI                        0xB0

CheckpointerShmem
checkpointer進程和其他後臺進程之間通訊的共享內存結構.

/*----------
 * Shared memory area for communication between checkpointer and backends
 * checkpointer進程和其他後臺進程之間通訊的共享內存結構.
 *
 * The ckpt counters allow backends to watch for completion of a checkpoint
 * request they send.  Here's how it works:
 *  * At start of a checkpoint, checkpointer reads (and clears) the request
 *    flags and increments ckpt_started, while holding ckpt_lck.
 *  * On completion of a checkpoint, checkpointer sets ckpt_done to
 *    equal ckpt_started.
 *  * On failure of a checkpoint, checkpointer increments ckpt_failed
 *    and sets ckpt_done to equal ckpt_started.
 * ckpt計數器可以讓後臺進程監控它們發出來的checkpoint請求是否已完成.其工作原理如下:
 *  * 在checkpoint啓動階段,checkpointer進程獲取並持有ckpt_lck鎖後,
 *    讀取(並清除)請求標誌並增加ckpt_started計數.
 *  * checkpoint成功完成時,checkpointer設置ckpt_done值等於ckpt_started.
 *  * checkpoint如執行失敗,checkpointer增加ckpt_failed計數,並設置ckpt_done值等於ckpt_started.
 *
 * The algorithm for backends is:
 *  1. Record current values of ckpt_failed and ckpt_started, and
 *     set request flags, while holding ckpt_lck.
 *  2. Send signal to request checkpoint.
 *  3. Sleep until ckpt_started changes.  Now you know a checkpoint has
 *     begun since you started this algorithm (although *not* that it was
 *     specifically initiated by your signal), and that it is using your flags.
 *  4. Record new value of ckpt_started.
 *  5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
 *     arithmetic here in case counters wrap around.)  Now you know a
 *     checkpoint has started and completed, but not whether it was
 *     successful.
 *  6. If ckpt_failed is different from the originally saved value,
 *     assume request failed; otherwise it was definitely successful.
 * 算法如下:
 *  1.獲取並持有ckpt_lck鎖後,記錄ckpt_failed和ckpt_started的當前值,並設置請求標誌.
 *  2.發送信號,請求checkpoint.
 *  3.休眠直至ckpt_started發生變化.
 *    現在您知道自您啓動此算法以來檢查點已經開始(儘管*不是*它是由您的信號具體發起的)，並且它正在使用您的標誌。
 *  4.記錄ckpt_started的新值.
 *  5.休眠,直至ckpt_done >= 已保存的ckpt_started值(取模).現在已知checkpoint已啓動&已完成,但checkpoint不一定成功.
 *  6.如果ckpt_failed與原來保存的值不同,則可以認爲請求失敗,否則它肯定是成功的.
 *
 * ckpt_flags holds the OR of the checkpoint request flags sent by all
 * requesting backends since the last checkpoint start.  The flags are
 * chosen so that OR'ing is the correct way to combine multiple requests.
 * ckpt_flags保存自上次檢查點啓動以來所有後臺進程發送的檢查點請求標誌的OR或標記。
 * 選擇標誌，以便OR'ing是組合多個請求的正確方法。
 * 
 * num_backend_writes is used to count the number of buffer writes performed
 * by user backend processes.  This counter should be wide enough that it
 * can't overflow during a single processing cycle.  num_backend_fsync
 * counts the subset of those writes that also had to do their own fsync,
 * because the checkpointer failed to absorb their request.
 * num_backend_writes用於計算用戶後臺進程寫入的緩衝區個數.
 * 在一個單獨的處理過程中,該計數器必須足夠大以防溢出.
 * num_backend_fsync計數那些必須執行fsync寫操作的子集，
 *   因爲checkpointer進程未能接受它們的請求。
 *
 * The requests array holds fsync requests sent by backends and not yet
 * absorbed by the checkpointer.
 * 請求數組存儲後臺進程發出的未被checkpointer進程拒絕的fsync請求.
 *
 * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
 * the requests fields are protected by CheckpointerCommLock.
 * 不同於checkpoint域,num_backend_writes/num_backend_fsync通過CheckpointerCommLock保護.
 * 
 *----------
 */
typedef struct
{
    RelFileNode rnode;//表空間/數據庫/Relation信息
    ForkNumber  forknum;//fork編號
    BlockNumber segno;          /* see md.c for special values */
    /* might add a real request-type field later; not needed yet */
} CheckpointerRequest;

typedef struct
{
    //checkpoint進程的pid(爲0則進程未啓動)
    pid_t       checkpointer_pid;   /* PID (0 if not started) */
    //用於保護所有的ckpt_*域
    slock_t     ckpt_lck;       /* protects all the ckpt_* fields */
    //在checkpoint啓動時計數
    int         ckpt_started;   /* advances when checkpoint starts */
    //在checkpoint完成時計數
    int         ckpt_done;      /* advances when checkpoint done */
    //在checkpoint失敗時計數
    int         ckpt_failed;    /* advances when checkpoint fails */
    //檢查點標記,在xlog.h中定義
    int         ckpt_flags;     /* checkpoint flags, as defined in xlog.h */
    //計數後臺進程緩存寫的次數
    uint32      num_backend_writes; /* counts user backend buffer writes */
    //計數後臺進程fsync調用次數
    uint32      num_backend_fsync;  /* counts user backend fsync calls */
    //當前的請求編號
    int         num_requests;   /* current # of requests */
    //最大的請求編號
    int         max_requests;   /* allocated array size */
    //請求數組
    CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
} CheckpointerShmemStruct;
//靜態變量(CheckpointerShmemStruct結構體指針)
static CheckpointerShmemStruct *CheckpointerShmem;

VirtualTransactionId
最頂層的事務通過VirtualTransactionIDs定義.

/*
 * Top-level transactions are identified by VirtualTransactionIDs comprising
 * the BackendId of the backend running the xact, plus a locally-assigned
 * LocalTransactionId.  These are guaranteed unique over the short term,
 * but will be reused after a database restart; hence they should never
 * be stored on disk.
 * 最高層的事務通過VirtualTransactionIDs定義.
 * VirtualTransactionIDs由執行事務的後臺進程BackendId和邏輯分配的LocalTransactionId組成.
 *
 * Note that struct VirtualTransactionId can not be assumed to be atomically
 * assignable as a whole.  However, type LocalTransactionId is assumed to
 * be atomically assignable, and the backend ID doesn't change often enough
 * to be a problem, so we can fetch or assign the two fields separately.
 * We deliberately refrain from using the struct within PGPROC, to prevent
 * coding errors from trying to use struct assignment with it; instead use
 * GET_VXID_FROM_PGPROC().
 * 請注意，不能假設struct VirtualTransactionId作爲一個整體是原子可分配的。
 * 但是,類型LocalTransactionId是假定原子可分配的,同時後臺進程ID不會經常變換,因此這不是一個問題,
 *   因此我們可以單獨提取或者分配這兩個域字段.
 * 
 */
typedef struct
{
    BackendId   backendId;      /* determined at backend startup */
    LocalTransactionId localTransactionId;  /* backend-local transaction id */
} VirtualTransactionId;

二、源碼解讀

CreateCheckPoint函數,執行checkpoint,不管是在shutdown過程還是在運行中.


/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 * 執行checkpoint,不管是在shutdown過程還是在運行中 
 *
 * flags is a bitwise OR of the following:
 *  CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
 *  CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
 *  CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
 *      ignoring checkpoint_completion_target parameter.
 *  CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
 *      since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
 *      CHECKPOINT_END_OF_RECOVERY).
 *  CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
 * flags標記說明:
 *  CHECKPOINT_IS_SHUTDOWN: 數據庫關閉過程中的checkpoint
 *  CHECKPOINT_END_OF_RECOVERY: 通過WAL恢復後的checkpoint
 *  CHECKPOINT_IMMEDIATE: 儘可能快的完成checkpoint,忽略checkpoint_completion_target參數
 *  CHECKPOINT_FORCE: 在最後一次checkpoint後就算沒有任何的XLOG活動發生,也強制執行checkpoint
 *                    (意味着CHECKPOINT_IS_SHUTDOWN或CHECKPOINT_END_OF_RECOVERY)
 *  CHECKPOINT_FLUSH_ALL: 包含unlogged tables一併刷盤
 *
 * Note: flags contains other bits, of interest here only for logging purposes.
 * In particular note that this routine is synchronous and does not pay
 * attention to CHECKPOINT_WAIT.
 * 注意:標誌還包含其他位，此處僅用於日誌記錄。
 * 特別注意的是該過程同步執行,並不會理會CHECKPOINT_WAIT.
 *
 * If !shutdown then we are writing an online checkpoint. This is a very special
 * kind of operation and WAL record because the checkpoint action occurs over
 * a period of time yet logically occurs at just a single LSN. The logical
 * position of the WAL record (redo ptr) is the same or earlier than the
 * physical position. When we replay WAL we locate the checkpoint via its
 * physical position then read the redo ptr and actually start replay at the
 * earlier logical position. Note that we don't write *anything* to WAL at
 * the logical position, so that location could be any other kind of WAL record.
 * All of this mechanism allows us to continue working while we checkpoint.
 * As a result, timing of actions is critical here and be careful to note that
 * this function will likely take minutes to execute on a busy system.
 * 如果並不處在shutdown過程中,那麼我們會等待一個在線checkpoint.
 * 這是一種非常特殊的操作和WAL記錄，因爲檢查點操作發生在一段時間內，而邏輯上只發生在一個LSN上。
 * WAL Record(redo ptr)的邏輯位置與物理位置相同或者小於物理位置.
 * 在回放WAL的時候我們通過checkpoint的物理位置定位位置,然後讀取redo ptr,
 *   實際上在更早的邏輯位置開始回放,這樣該位置可以是任意類型的WAL Record.
 * 這種機制的目的是允許我們在checkpoint的時候不需要暫停.
 * 這種機制的結果是操作的時間會比較長,要小心的是在繁忙的系統中,該操作可能會持續數分鐘.
 */
void
CreateCheckPoint(int flags)
{
    bool        shutdown;//是否處於shutdown?
    CheckPoint  checkPoint;//checkpoint
    XLogRecPtr  recptr;//XLOG Record位置
    XLogSegNo   _logSegNo;//LSN(uint64)
    XLogCtlInsert *Insert = &XLogCtl->Insert;//控制器
    uint32      freespace;//空閒空間
    XLogRecPtr  PriorRedoPtr;//上一個Redo point
    XLogRecPtr  curInsert;//當前插入的位置
    XLogRecPtr  last_important_lsn;//上一個重要的LSN
    VirtualTransactionId *vxids;//虛擬事務ID
    int         nvxids;

    /*
     * An end-of-recovery checkpoint is really a shutdown checkpoint, just
     * issued at a different time.
     * end-of-recovery checkpoint事實上是shutdown checkpoint,只不過是在一個不同的時間發生的.
     */
    if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
        shutdown = true;
    else
        shutdown = false;

    /* sanity check */
    //驗證
    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
        elog(ERROR, "can't create a checkpoint during recovery");

    /*
     * Initialize InitXLogInsert working areas before entering the critical
     * section.  Normally, this is done by the first call to
     * RecoveryInProgress() or LocalSetXLogInsertAllowed(), but when creating
     * an end-of-recovery checkpoint, the LocalSetXLogInsertAllowed call is
     * done below in a critical section, and InitXLogInsert cannot be called
     * in a critical section.
     * 在進入critical section前,初始化InitXLogInsert工作空間.
     * 通常來說,第一次調用RecoveryInProgress() or LocalSetXLogInsertAllowed()時已完成,
     *   但在創建end-of-recovery checkpoint時,在下面的邏輯中LocalSetXLogInsertAllowed調用完成時,
     *   InitXLogInsert不能在critical section中調用.
     */
    InitXLogInsert();

    /*
     * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
     * (This is just pro forma, since in the present system structure there is
     * only one process that is allowed to issue checkpoints at any given
     * time.)
     * 請求CheckpointLock確保在同一時刻只能存在一個checkpoint.
     * (這只是形式上的，因爲在目前的系統架構中，在任何給定的時間只允許一個進程發出檢查點。)
     */
    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

    /*
     * Prepare to accumulate statistics.
     * 爲統計做準備.
     *
     * Note: because it is possible for log_checkpoints to change while a
     * checkpoint proceeds, we always accumulate stats, even if
     * log_checkpoints is currently off.
     * 注意:在checkpoint執行過程總,log_checkpoints可能會出現變化,
     *   因此我們通常會累計stats,即使log_checkpoints爲off
     */
    MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
    CheckpointStats.ckpt_start_t = GetCurrentTimestamp();

    /*
     * Use a critical section to force system panic if we have trouble.
     * 使用critical section,強制系統在出現問題時進行應對.
     */
    START_CRIT_SECTION();

    if (shutdown)
    {
        //shutdown = T
        //更新control file(pg_control文件)
        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
        ControlFile->state = DB_SHUTDOWNING;
        ControlFile->time = (pg_time_t) time(NULL);
        UpdateControlFile();
        LWLockRelease(ControlFileLock);
    }

    /*
     * Let smgr prepare for checkpoint; this has to happen before we determine
     * the REDO pointer.  Note that smgr must not do anything that'd have to
     * be undone if we decide no checkpoint is needed.
     * 讓smgr(資源管理器)爲checkpoint作準備.
     * 在確定REDO pointer時必須執行.
     * 請注意，如果我們決定不執行checkpoint，那麼smgr不能執行任何必須撤消的操作。
     */
    smgrpreckpt();

    /* Begin filling in the checkpoint WAL record */
    //填充Checkpoint XLOG Record
    MemSet(&checkPoint, 0, sizeof(checkPoint));
    checkPoint.time = (pg_time_t) time(NULL);//時間

    /*
     * For Hot Standby, derive the oldestActiveXid before we fix the redo
     * pointer. This allows us to begin accumulating changes to assemble our
     * starting snapshot of locks and transactions.
     * 對於Hot Standby,在修改redo pointer前,推導出oldestActiveXid.
     * 這可以讓我們可以累計變化以組裝開始的snapshot的locks和transactions.
     */
    if (!shutdown && XLogStandbyInfoActive())
        checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
    else
        checkPoint.oldestActiveXid = InvalidTransactionId;

    /*
     * Get location of last important record before acquiring insert locks (as
     * GetLastImportantRecPtr() also locks WAL locks).
     * 在請求插入locks前,獲取最後一個重要的XLOG Record的位置.
     * (GetLastImportantRecPtr()函數會獲取WAL locks)
     */
    last_important_lsn = GetLastImportantRecPtr();

    /*
     * We must block concurrent insertions while examining insert state to
     * determine the checkpoint REDO pointer.
     * 在檢查插入狀態確定checkpoint的REDO pointer時,必須阻塞同步插入操作.
     */
    WALInsertLockAcquireExclusive();
    curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);

    /*
     * If this isn't a shutdown or forced checkpoint, and if there has been no
     * WAL activity requiring a checkpoint, skip it.  The idea here is to
     * avoid inserting duplicate checkpoints when the system is idle.
     * 不是shutdow或強制checkpoint,而且在請求時如果沒有WAL活動,則跳過.
     * 這裏的思想是避免在系統空閒時插入重複的checkpoints
     */
    if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
                  CHECKPOINT_FORCE)) == 0)
    {
        if (last_important_lsn == ControlFile->checkPoint)
        {
            WALInsertLockRelease();
            LWLockRelease(CheckpointLock);
            END_CRIT_SECTION();
            ereport(DEBUG1,
                    (errmsg("checkpoint skipped because system is idle")));
            return;
        }
    }

    /*
     * An end-of-recovery checkpoint is created before anyone is allowed to
     * write WAL. To allow us to write the checkpoint record, temporarily
     * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
     * initialized, which we need here and in AdvanceXLInsertBuffer.)
     * 在允許寫入WAL後纔會創建end-of-recovery checkpoint.
     * 這可以讓我們寫Checkpoint Record,臨時啓用XLogInsertAllowed.
     * (這同樣可以確保已初始化在這裏和AdvanceXLInsertBuffer中需要的變量ThisTimeLineID)
     */
    if (flags & CHECKPOINT_END_OF_RECOVERY)
        LocalSetXLogInsertAllowed();

    checkPoint.ThisTimeLineID = ThisTimeLineID;
    if (flags & CHECKPOINT_END_OF_RECOVERY)
        checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
    else
        checkPoint.PrevTimeLineID = ThisTimeLineID;

    checkPoint.fullPageWrites = Insert->fullPageWrites;

    /*
     * Compute new REDO record ptr = location of next XLOG record.
     * 計算新的REDO record ptr = 下一個XLOG Record的位置.
     * 
     * NB: this is NOT necessarily where the checkpoint record itself will be,
     * since other backends may insert more XLOG records while we're off doing
     * the buffer flush work.  Those XLOG records are logically after the
     * checkpoint, even though physically before it.  Got that?
     * 注意:這並不一定是檢查點記錄本身所在的位置，因爲當我們停止緩衝區刷新工作時，
     *   其他後臺進程可能會插入更多的XLOG Record。
     * 這些XLOG Records邏輯上會在checkpoint之後,雖然物理上可能在checkpoint之前.
     */
    freespace = INSERT_FREESPACE(curInsert);//獲取空閒空間
    if (freespace == 0)
    {
        //沒有空閒空間了
        if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
            curInsert += SizeOfXLogLongPHD;//新的WAL segment file,偏移爲LONG header
        else
            curInsert += SizeOfXLogShortPHD;//原WAL segment file,偏移爲常規的header
    }
    checkPoint.redo = curInsert;

    /*
     * Here we update the shared RedoRecPtr for future XLogInsert calls; this
     * must be done while holding all the insertion locks.
     * 在這裏，我們更新共享的RedoRecPtr以備將來的XLogInsert調用;
     *   這必須在持有所有插入鎖才能完成。
     *
     * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
     * pointing past where it really needs to point.  This is okay; the only
     * consequence is that XLogInsert might back up whole buffers that it
     * didn't really need to.  We can't postpone advancing RedoRecPtr because
     * XLogInserts that happen while we are dumping buffers must assume that
     * their buffer changes are not included in the checkpoint.
     * 注意:如果checkpoint失敗,RedoRecPtr仍會指向實際上它應指向的位置.
     * 這種做法沒有問題,唯一需要處理的XLogInsert可能會備份它並不真正需要的整個緩衝區.
     * 我們不能推遲推進RedoRecPtr，因爲在轉儲緩衝區時發生的XLogInserts,
     *   必須假設它們的緩衝區更改不包含在該檢查點中。
     */
    RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

    /*
     * Now we can release the WAL insertion locks, allowing other xacts to
     * proceed while we are flushing disk buffers.
     * 現在可以釋放WAL插入鎖,允許其他事務在刷新磁盤緩衝區時可以執行.
     */
    WALInsertLockRelease();

    /* Update the info_lck-protected copy of RedoRecPtr as well */
    //同時,更新RedoRecPtr的info_lck-protected拷貝鎖.
    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->RedoRecPtr = checkPoint.redo;
    SpinLockRelease(&XLogCtl->info_lck);

    /*
     * If enabled, log checkpoint start.  We postpone this until now so as not
     * to log anything if we decided to skip the checkpoint.
     * 如啓用log_checkpoints,則記錄checkpoint日誌啓動.
     * 我們將此推遲到現在，以便在決定跳過檢查點時不記錄任何東西。
     */
    if (log_checkpoints)
        LogCheckpointStart(flags, false);

    TRACE_POSTGRESQL_CHECKPOINT_START(flags);

    /*
     * Get the other info we need for the checkpoint record.
     * 獲取其他組裝checkpoint記錄的信息.
     *
     * We don't need to save oldestClogXid in the checkpoint, it only matters
     * for the short period in which clog is being truncated, and if we crash
     * during that we'll redo the clog truncation and fix up oldestClogXid
     * there.
     * 我們不需要在檢查點中保存oldestClogXid，它只在截斷clog的短時間內起作用，
     *   如果在此期間崩潰，我們將重新截斷clog並在修復oldestClogXid。
     */
    LWLockAcquire(XidGenLock, LW_SHARED);
    checkPoint.nextXid = ShmemVariableCache->nextXid;
    checkPoint.oldestXid = ShmemVariableCache->oldestXid;
    checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
    LWLockRelease(XidGenLock);

    LWLockAcquire(CommitTsLock, LW_SHARED);
    checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
    checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
    LWLockRelease(CommitTsLock);

    /* Increase XID epoch if we've wrapped around since last checkpoint */
    //如果我們從上一個checkpoint開始wrapped around，則增加XID epoch
    checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
    if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
        checkPoint.nextXidEpoch++;

    LWLockAcquire(OidGenLock, LW_SHARED);
    checkPoint.nextOid = ShmemVariableCache->nextOid;
    if (!shutdown)
        checkPoint.nextOid += ShmemVariableCache->oidCount;
    LWLockRelease(OidGenLock);

    MultiXactGetCheckptMulti(shutdown,
                             &checkPoint.nextMulti,
                             &checkPoint.nextMultiOffset,
                             &checkPoint.oldestMulti,
                             &checkPoint.oldestMultiDB);

    /*
     * Having constructed the checkpoint record, ensure all shmem disk buffers
     * and commit-log buffers are flushed to disk.
     * 在構造checkpoint XLOG Record之後，確保所有shmem disk buffers和clog緩衝區都被刷到磁盤中。
     *
     * This I/O could fail for various reasons.  If so, we will fail to
     * complete the checkpoint, but there is no reason to force a system
     * panic. Accordingly, exit critical section while doing it.
     * 刷盤I/O可能會因爲很多原因失敗.
     * 如果出現問題,那麼checkpoint會失敗,但沒有理由強制要求系統panic.
     * 相反,在做這些工作時退出critical section.
     */
    END_CRIT_SECTION();

    /*
     * In some cases there are groups of actions that must all occur on one
     * side or the other of a checkpoint record. Before flushing the
     * checkpoint record we must explicitly wait for any backend currently
     * performing those groups of actions.
     * 在某些情況下，必須在checkpoint XLOG Record的一邊或另一邊執行一組操作。
     * 在刷新checkpoint XLOG Record之前，我們必須顯式地等待當前執行這些操作組的所有後臺進程。
     *
     * One example is end of transaction, so we must wait for any transactions
     * that are currently in commit critical sections.  If an xact inserted
     * its commit record into XLOG just before the REDO point, then a crash
     * restart from the REDO point would not replay that record, which means
     * that our flushing had better include the xact's update of pg_xact.  So
     * we wait till he's out of his commit critical section before proceeding.
     * See notes in RecordTransactionCommit().
     * 其中一個例子是事務結束,我們必須等待當前正處於commit critical sections的事務結束.
     * 如果某個事務正好在REDO point前插入commit record到XLOG中,
     *   如果系統crash,則重啓後,從REDO point起讀取時不會回放該commit記錄,
     *   這意味着我們的刷盤最好包含xact對pg_xact的更新.
     * 所以我們要等到該進程離開commit critical section後再繼續。
     * 參見RecordTransactionCommit()中的註釋。
     *
     * Because we've already released the insertion locks, this test is a bit
     * fuzzy: it is possible that we will wait for xacts we didn't really need
     * to wait for.  But the delay should be short and it seems better to make
     * checkpoint take a bit longer than to hold off insertions longer than
     * necessary. (In fact, the whole reason we have this issue is that xact.c
     * does commit record XLOG insertion and clog update as two separate steps
     * protected by different locks, but again that seems best on grounds of
     * minimizing lock contention.)
     * 因爲我們已經釋放了插入鎖,這個測試有點模糊:有可能我們將等待我們實際上不需要等待的xacts。
     * 但是延遲應該很短，讓檢查點花費的時間比延遲插入所需的時間長一些似乎更好。
     * (實際上,我們遇到這個問題的原因是xact.c將commit record XLOG插入和clog更新作爲兩個單獨的步驟提交，
     *  這兩個操作由不同的鎖進行保護，但基於最小化鎖爭用的理由這看起來是最好的。)
     *
     * A transaction that has not yet set delayChkpt when we look cannot be at
     * risk, since he's not inserted his commit record yet; and one that's
     * already cleared it is not at risk either, since he's done fixing clog
     * and we will correctly flush the update below.  So we cannot miss any
     * xacts we need to wait for.
     * 在我們搜索時,尚未設置delayChkpt的事務不會存在風險，因爲該事務還沒有插入它的提交記錄;
     * 同樣的已清除了delayChkpt的事務也不會有風險,因爲該事務已修改了clog,
     *   我們可以正確的在下面的處理邏輯中刷新更新.
     * 因此我們不能錯失我們需要等待的所有xacts.
     */
    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);//獲取虛擬事務XID
    if (nvxids > 0)
    {
        do
        {
            //等待10ms
            pg_usleep(10000L);  /* wait for 10 msec */
        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
    }
    pfree(vxids);
    //把共享內存中的數據刷到磁盤上,並執行fsync
    CheckPointGuts(checkPoint.redo, flags);

    /*
     * Take a snapshot of running transactions and write this to WAL. This
     * allows us to reconstruct the state of running transactions during
     * archive recovery, if required. Skip, if this info disabled.
     * 獲取正在運行的事務的快照，並將其寫入WAL。
     * 如果需要，這允許我們在歸檔恢復期間重建正在運行的事務的狀態。
     * 如果禁用此消息,則禁用。
     * 
     * If we are shutting down, or Startup process is completing crash
     * recovery we don't need to write running xact data.
     * 如果正在關閉數據庫,或者啓動進程已完成crash recovery,
     *   則不需要寫正在運行的事務數據.
     */
    if (!shutdown && XLogStandbyInfoActive())
        LogStandbySnapshot();

    START_CRIT_SECTION();//進入critical section.

    /*
     * Now insert the checkpoint record into XLOG.
     * 現在可以插入checkpoint record到XLOG中了.
     */
    XLogBeginInsert();//開始插入
    XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));//註冊數據
    recptr = XLogInsert(RM_XLOG_ID,
                        shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
                        XLOG_CHECKPOINT_ONLINE);//執行插入

    XLogFlush(recptr);//刷盤

    /*
     * We mustn't write any new WAL after a shutdown checkpoint, or it will be
     * overwritten at next startup.  No-one should even try, this just allows
     * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
     * to just temporarily disable writing until the system has exited
     * recovery.
     * 我們不能在關閉檢查點之後寫入任何新的WAL，否則它將在下一次啓動時被覆蓋。
     * 而且不應該進行這樣的嘗試，只允許健康檢查。
     * 在end-of-recovery checkpoint情況下，我們只想暫時禁用寫入，直到系統退出恢復。
     */
    if (shutdown)
    {
        //關閉過程中
        if (flags & CHECKPOINT_END_OF_RECOVERY)
            LocalXLogInsertAllowed = -1;    /* return to "check" state */
        else
            LocalXLogInsertAllowed = 0; /* never again write WAL */
    }

    /*
     * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
     * = end of actual checkpoint record.
     * 現在我們有:
     *   ProcLastRecPtr = 實際的checkpoint XLOG record的起始位置,
     *   recptr = 實際checkpoint XLOG record的結束位置.
     */
    if (shutdown && checkPoint.redo != ProcLastRecPtr)
        ereport(PANIC,
                (errmsg("concurrent write-ahead log activity while database system is shutting down")));

    /*
     * Remember the prior checkpoint's redo ptr for
     * UpdateCheckPointDistanceEstimate()
     * 爲UpdateCheckPointDistanceEstimate()記錄上一個checkpoint的REDO ptr
     */
    PriorRedoPtr = ControlFile->checkPointCopy.redo;

    /*
     * Update the control file.
     * 更新控制文件(pg_control)
     */
    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    if (shutdown)
        ControlFile->state = DB_SHUTDOWNED;
    ControlFile->checkPoint = ProcLastRecPtr;
    ControlFile->checkPointCopy = checkPoint;
    ControlFile->time = (pg_time_t) time(NULL);
    /* crash recovery should always recover to the end of WAL */
    //crash recovery通常來說應恢復至WAL的末尾
    ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
    ControlFile->minRecoveryPointTLI = 0;

    /*
     * Persist unloggedLSN value. It's reset on crash recovery, so this goes
     * unused on non-shutdown checkpoints, but seems useful to store it always
     * for debugging purposes.
     * 持久化unloggedLSN值.
     * 它是在崩潰恢復時重置的，因此在非關閉檢查點上不使用，但是爲了調試目的而總是存儲它似乎很有用。
     */
    SpinLockAcquire(&XLogCtl->ulsn_lck);
    ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
    SpinLockRelease(&XLogCtl->ulsn_lck);

    UpdateControlFile();
    LWLockRelease(ControlFileLock);

    /* Update shared-memory copy of checkpoint XID/epoch */
    //更新checkpoint XID/epoch的共享內存拷貝
    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
    XLogCtl->ckptXid = checkPoint.nextXid;
    SpinLockRelease(&XLogCtl->info_lck);

    /*
     * We are now done with critical updates; no need for system panic if we
     * have trouble while fooling with old log segments.
     * 已完成critical updates.
     */
    END_CRIT_SECTION();

    /*
     * Let smgr do post-checkpoint cleanup (eg, deleting old files).
     * 讓smgr執行checkpoint收尾工作(比如刪除舊文件等).
     */
    smgrpostckpt();

    /*
     * Update the average distance between checkpoints if the prior checkpoint
     * exists.
     * 如上一個checkpoint存在,則更新兩者之間的平均距離.
     */
    if (PriorRedoPtr != InvalidXLogRecPtr)
        UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);

    /*
     * Delete old log files, those no longer needed for last checkpoint to
     * prevent the disk holding the xlog from growing full.
     * 刪除舊的日誌文件，這些文件自最後一個檢查點後已不再需要，
     *   以防止保存xlog的磁盤撐滿。
     */
    XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
    KeepLogSeg(recptr, &_logSegNo);
    _logSegNo--;
    RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);

    /*
     * Make more log segments if needed.  (Do this after recycling old log
     * segments, since that may supply some of the needed files.)
     * 如需要,申請更多的log segments.
     * (在循環使用舊的log segments時纔來做這個事情,因爲那樣會需要一些需要的文件)
     */
    if (!shutdown)
        PreallocXlogFiles(recptr);

    /*
     * Truncate pg_subtrans if possible.  We can throw away all data before
     * the oldest XMIN of any running transaction.  No future transaction will
     * attempt to reference any pg_subtrans entry older than that (see Asserts
     * in subtrans.c).  During recovery, though, we mustn't do this because
     * StartupSUBTRANS hasn't been called yet.
     * 如可能,截斷pg_subtrans.
     * 我們可以在任何正在運行的事務的最老的XMIN之前丟棄所有數據。
     * 以後的事務都不會嘗試引用任何比這更早的pg_subtrans條目(參見sub.c中的斷言)。
     * 但是在恢復期間，我們不能這樣做，因爲StartupSUBTRANS還沒有被調用。
     * 
     */
    if (!RecoveryInProgress())
        TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));

    /* Real work is done, but log and update stats before releasing lock. */
    //實際的工作已完成,除了記錄日誌已經更新統計信息.
    LogCheckpointEnd(false);

    TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                     NBuffers,
                                     CheckpointStats.ckpt_segs_added,
                                     CheckpointStats.ckpt_segs_removed,
                                     CheckpointStats.ckpt_segs_recycled);
    //釋放鎖
    LWLockRelease(CheckpointLock);
}

/*
 * Flush all data in shared memory to disk, and fsync
 * 把共享內存中的數據刷到磁盤上,並執行fsync
 *
 * This is the common code shared between regular checkpoints and
 * recovery restartpoints.
 * 不管是普通的checkpoints還是recovery restartpoints,這些代碼都是共享的.
 */
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
    CheckPointCLOG();
    CheckPointCommitTs();
    CheckPointSUBTRANS();
    CheckPointMultiXact();
    CheckPointPredicate();
    CheckPointRelationMap();
    CheckPointReplicationSlots();
    CheckPointSnapBuild();
    CheckPointLogicalRewriteHeap();
    CheckPointBuffers(flags);   /* performs all required fsyncs */
    CheckPointReplicationOrigin();
    /* We deliberately delay 2PC checkpointing as long as possible */
    CheckPointTwoPhase(checkPointRedo);
}

三、跟蹤分析

更新數據,執行checkpoint.

testdb=# update t_wal_ckpt set c2 = 'C2_'||substr(c2,4,40);
UPDATE 1
testdb=# checkpoint;

啓動gdb,設置信號控制,設置斷點,進入CreateCheckPoint

(gdb) handle SIGINT print nostop pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal        Stop  Print Pass to program Description
SIGINT        No  Yes Yes   Interrupt
(gdb) 
(gdb) b CreateCheckPoint
Breakpoint 1 at 0x55b4fb: file xlog.c, line 8668.
(gdb) c
Continuing.

Program received signal SIGINT, Interrupt.

Breakpoint 1, CreateCheckPoint (flags=44) at xlog.c:8668
8668        XLogCtlInsert *Insert = &XLogCtl->Insert;
(gdb)

獲取XLOG插入控制器

8668        XLogCtlInsert *Insert = &XLogCtl->Insert;
(gdb) n
8680        if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
(gdb) p XLogCtl
$1 = (XLogCtlData *) 0x7fadf8f6fa80
(gdb) p *XLogCtl
$2 = {Insert = {insertpos_lck = 0 '\000', CurrBytePos = 5505269968, PrevBytePos = 5505269928, 
    pad = '\000' <repeats 127 times>, RedoRecPtr = 5521450856, forcePageWrites = false, fullPageWrites = true, 
    exclusiveBackupState = EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups = 0, lastBackupStart = 0, 
    WALInsertLocks = 0x7fadf8f74100}, LogwrtRqst = {Write = 5521451392, Flush = 5521451392}, RedoRecPtr = 5521450856, 
  ckptXidEpoch = 0, ckptXid = 2307, asyncXactLSN = 5521363848, replicationSlotMinLSN = 0, lastRemovedSegNo = 0, 
  unloggedLSN = 1, ulsn_lck = 0 '\000', lastSegSwitchTime = 1546915130, lastSegSwitchLSN = 5521363360, LogwrtResult = {
    Write = 5521451392, Flush = 5521451392}, InitializedUpTo = 5538226176, pages = 0x7fadf8f76000 "\230\320\006", 
  xlblocks = 0x7fadf8f70088, XLogCacheBlck = 2047, ThisTimeLineID = 1, PrevTimeLineID = 1, 
  archiveCleanupCommand = '\000' <repeats 1023 times>, SharedRecoveryInProgress = false, SharedHotStandbyActive = false, 
  WalWriterSleeping = true, recoveryWakeupLatch = {is_set = 0, is_shared = true, owner_pid = 0}, lastCheckPointRecPtr = 0, 
  lastCheckPointEndPtr = 0, lastCheckPoint = {redo = 0, ThisTimeLineID = 0, PrevTimeLineID = 0, fullPageWrites = false, 
    nextXidEpoch = 0, nextXid = 0, nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, 
    oldestMulti = 0, oldestMultiDB = 0, time = 0, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}, 
  lastReplayedEndRecPtr = 0, lastReplayedTLI = 0, replayEndRecPtr = 0, replayEndTLI = 0, recoveryLastXTime = 0, 
  currentChunkStartTime = 0, recoveryPause = false, lastFpwDisableRecPtr = 0, info_lck = 0 '\000'}
(gdb) p *Insert
$4 = {insertpos_lck = 0 '\000', CurrBytePos = 5505269968, PrevBytePos = 5505269928, pad = '\000' <repeats 127 times>, 
  RedoRecPtr = 5521450856, forcePageWrites = false, fullPageWrites = true, exclusiveBackupState = EXCLUSIVE_BACKUP_NONE, 
  nonExclusiveBackups = 0, lastBackupStart = 0, WALInsertLocks = 0x7fadf8f74100}
(gdb)

RedoRecPtr = 5521450856,這是REDO point,與pg_control文件中的值一致

[xdb@localhost ~]$ echo "obase=16;ibase=10;5521450856"|bc
1491AA768
[xdb@localhost ~]$ pg_controldata|grep REDO
Latest checkpoint's REDO location:    1/491AA768
Latest checkpoint's REDO WAL file:    000000010000000100000049
[xdb@localhost ~]$

在進入critical section前,初始化InitXLogInsert工作空間.
請求CheckpointLock確保在同一時刻只能存在一個checkpoint.

(gdb) n
8683            shutdown = false;
(gdb) 
8686        if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
(gdb) 
8697        InitXLogInsert();
(gdb) 
8705        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
(gdb) 
8714        MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
(gdb) 
8715        CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
(gdb)

進入critical section,讓smgr(資源管理器)爲checkpoint作準備.

8720        START_CRIT_SECTION();
(gdb) 
(gdb) 
8722        if (shutdown)
(gdb) 
8736        smgrpreckpt();
(gdb) 
8739        MemSet(&checkPoint, 0, sizeof(checkPoint));
(gdb)

開始填充Checkpoint XLOG Record

(gdb) 
8740        checkPoint.time = (pg_time_t) time(NULL);
(gdb) p checkPoint
$5 = {redo = 0, ThisTimeLineID = 0, PrevTimeLineID = 0, fullPageWrites = false, nextXidEpoch = 0, nextXid = 0, nextOid = 0, 
  nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, oldestMulti = 0, oldestMultiDB = 0, time = 0, 
  oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) n
8747        if (!shutdown && XLogStandbyInfoActive())
(gdb) 
8750            checkPoint.oldestActiveXid = InvalidTransactionId;

在請求插入locks前,獲取最後一個重要的XLOG Record的位置.

(gdb) 
8756        last_important_lsn = GetLastImportantRecPtr();
(gdb) 
8762        WALInsertLockAcquireExclusive();
(gdb) 
(gdb) p last_important_lsn
$6 = 5521451352 --> 0x1491AA958

在檢查插入狀態確定checkpoint的REDO pointer時,必須阻塞同步插入操作.

(gdb) n
8763        curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
(gdb) 
8770        if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
(gdb) p curInsert
$7 = 5521451392 --> 0x1491AA980
(gdb)

繼續填充Checkpoint XLOG Record

(gdb) n
8790        if (flags & CHECKPOINT_END_OF_RECOVERY)
(gdb) 
8793        checkPoint.ThisTimeLineID = ThisTimeLineID;
(gdb) 
8794        if (flags & CHECKPOINT_END_OF_RECOVERY)
(gdb) 
8797            checkPoint.PrevTimeLineID = ThisTimeLineID;
(gdb) p ThisTimeLineID
$8 = 1
(gdb) n
8799        checkPoint.fullPageWrites = Insert->fullPageWrites;
(gdb) 
8809        freespace = INSERT_FREESPACE(curInsert);
(gdb) 
8810        if (freespace == 0)
(gdb) p freespace
$9 = 5760
(gdb) n
8817        checkPoint.redo = curInsert;
(gdb) 
8830        RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
(gdb) 
(gdb) p checkPoint
$10 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 0, 
  nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, oldestMulti = 0, oldestMultiDB = 0, 
  time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb)

更新共享的RedoRecPtr以備將來的XLogInsert調用,必須在持有所有插入鎖才能完成。

(gdb) n
8836        WALInsertLockRelease();
(gdb) 
8839        SpinLockAcquire(&XLogCtl->info_lck);
(gdb) 
8840        XLogCtl->RedoRecPtr = checkPoint.redo;
(gdb) 
8841        SpinLockRelease(&XLogCtl->info_lck);
(gdb) 
8847        if (log_checkpoints)
(gdb) 
(gdb) p XLogCtl->RedoRecPtr
$11 = 5521451392

獲取其他組裝checkpoint記錄的信息.

(gdb) n
8850        TRACE_POSTGRESQL_CHECKPOINT_START(flags);
(gdb) 
8860        LWLockAcquire(XidGenLock, LW_SHARED);
(gdb) 
8861        checkPoint.nextXid = ShmemVariableCache->nextXid;
(gdb) 
8862        checkPoint.oldestXid = ShmemVariableCache->oldestXid;
(gdb) 
8863        checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
(gdb) 
8864        LWLockRelease(XidGenLock);
(gdb) 
8866        LWLockAcquire(CommitTsLock, LW_SHARED);
(gdb) 
8867        checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
(gdb) 
8868        checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
(gdb) 
8869        LWLockRelease(CommitTsLock);
(gdb) 
8872        checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
(gdb) n
8873        if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
(gdb) 
8876        LWLockAcquire(OidGenLock, LW_SHARED);
(gdb) 
8877        checkPoint.nextOid = ShmemVariableCache->nextOid;
(gdb) p checkPoint
$13 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, 
  nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 0, 
  oldestMultiDB = 0, time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) n
8878        if (!shutdown)
(gdb) 
8879            checkPoint.nextOid += ShmemVariableCache->oidCount;
(gdb) 
8880        LWLockRelease(OidGenLock);
(gdb) p *ShmemVariableCache
$14 = {nextOid = 42575, oidCount = 8189, nextXid = 2308, oldestXid = 561, xidVacLimit = 200000561, 
  xidWarnLimit = 2136484208, xidStopLimit = 2146484208, xidWrapLimit = 2147484208, oldestXidDB = 16400, 
  oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid = 2307, oldestClogXid = 561}
(gdb) n
8882        MultiXactGetCheckptMulti(shutdown,
(gdb)

再次查看checkpoint結構體

(gdb) p checkPoint
$15 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, 
  nextOid = 50764, nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 1, 
  oldestMultiDB = 16402, time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb)

結束CRIT_SECTION

(gdb) 
8896        END_CRIT_SECTION();

獲取虛擬事務ID(無效的信息)

(gdb) n
8927        vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
(gdb) 
8928        if (nvxids > 0)
(gdb) p vxids
$16 = (VirtualTransactionId *) 0x2f4eb20
(gdb) p *vxids
$17 = {backendId = 2139062143, localTransactionId = 2139062143}
(gdb) p nvxids
$18 = 0
(gdb) 
(gdb) n
8935        pfree(vxids);
(gdb)

把共享內存中的數據刷到磁盤上,並執行fsync

(gdb) 
8937        CheckPointGuts(checkPoint.redo, flags);
(gdb) p flags
$19 = 44
(gdb) n
8947        if (!shutdown && XLogStandbyInfoActive())
(gdb)

進入critical section.

(gdb) n
8950        START_CRIT_SECTION();
(gdb)

現在可以插入checkpoint record到XLOG中了.

(gdb) 
8955        XLogBeginInsert();
(gdb) n
8956        XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
(gdb) 
8957        recptr = XLogInsert(RM_XLOG_ID,
(gdb) 
8961        XLogFlush(recptr);
(gdb) 
8970        if (shutdown)
(gdb)

更新控制文件(pg_control),首先爲UpdateCheckPointDistanceEstimate()記錄上一個checkpoint的REDO ptr

(gdb) 
8982        if (shutdown && checkPoint.redo != ProcLastRecPtr)
(gdb) 
8990        PriorRedoPtr = ControlFile->checkPointCopy.redo;
(gdb) 
8995        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
(gdb) p ControlFile->checkPointCopy.redo
$20 = 5521450856
(gdb) n
8996        if (shutdown)
(gdb) 
8998        ControlFile->checkPoint = ProcLastRecPtr;
(gdb) 
8999        ControlFile->checkPointCopy = checkPoint;
(gdb) 
9000        ControlFile->time = (pg_time_t) time(NULL);
(gdb) 
9002        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
(gdb) 
9003        ControlFile->minRecoveryPointTLI = 0;
(gdb) 
9010        SpinLockAcquire(&XLogCtl->ulsn_lck);
(gdb) 
9011        ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
(gdb) 
9012        SpinLockRelease(&XLogCtl->ulsn_lck);
(gdb) 
9014        UpdateControlFile();
(gdb) 
9015        LWLockRelease(ControlFileLock);
(gdb) 
9018        SpinLockAcquire(&XLogCtl->info_lck);
(gdb) p *ControlFile
$21 = {system_identifier = 6624362124887945794, pg_control_version = 1100, catalog_version_no = 201809051, 
  state = DB_IN_PRODUCTION, time = 1546934255, checkPoint = 5521451392, checkPointCopy = {redo = 5521451392, 
    ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, nextOid = 50764, 
    nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 1, oldestMultiDB = 16402, 
    time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}, unloggedLSN = 1, 
  minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, backupEndRequired = false, 
  wal_level = 0, wal_log_hints = false, MaxConnections = 100, max_worker_processes = 8, max_prepared_xacts = 0, 
  max_locks_per_xact = 64, track_commit_timestamp = false, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, 
  relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, 
  toast_max_chunk_size = 1996, loblksize = 2048, float4ByVal = true, float8ByVal = true, data_checksum_version = 0, 
  mock_authentication_nonce = "\220\277\067Vg\003\205\232U{\177 h\216\271D\266\063[\\=6\365S\tA\353\361ߧw\301", 
  crc = 930305687}
(gdb)

更新checkpoint XID/epoch的共享內存拷貝,退出critical section,並讓smgr執行checkpoint收尾工作(比如刪除舊文件等).

(gdb) n
9019        XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
(gdb) 
9020        XLogCtl->ckptXid = checkPoint.nextXid;
(gdb) 
9021        SpinLockRelease(&XLogCtl->info_lck);
(gdb) 
9027        END_CRIT_SECTION();
(gdb) 
9032        smgrpostckpt();
(gdb)

刪除舊的日誌文件，這些文件自最後一個檢查點後已不再需要，以防止保存xlog的磁盤撐滿。

(gdb) n
9038        if (PriorRedoPtr != InvalidXLogRecPtr)
(gdb) p PriorRedoPtr
$23 = 5521450856
(gdb) n
9039            UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
(gdb) 
9045        XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
(gdb) 
9046        KeepLogSeg(recptr, &_logSegNo);
(gdb) p RedoRecPtr
$24 = 5521451392
(gdb) p _logSegNo
$25 = 329
(gdb) p wal_segment_size
$26 = 16777216
(gdb) n
9047        _logSegNo--;
(gdb) 
9048        RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
(gdb) 
9054        if (!shutdown)
(gdb) p recptr
$27 = 5521451504
(gdb)

執行其他相關收尾工作

(gdb) n
9055            PreallocXlogFiles(recptr);
(gdb) 
9064        if (!RecoveryInProgress())
(gdb) 
9065            TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
(gdb) 
9068        LogCheckpointEnd(false);
(gdb) 
9070        TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
(gdb) 
9076        LWLockRelease(CheckpointLock);
(gdb) 
9077    }
(gdb)

完成調用

(gdb) 
CheckpointerMain () at checkpointer.c:488
488                 ckpt_performed = true;
(gdb)

DONE!

四、參考資料

checkpointer.c

PostgreSQL 源碼解讀（115）- 後臺進程#3（checkpointer進程#2）

一、數據結構

二、源碼解讀

三、跟蹤分析

四、參考資料

vue項目獲取富文本編輯器wangEditor內容導出爲word（html轉word格式並下載）

dotnet C# 創建 X11 應用時設置窗口背景顏色

Navicat安裝與激活教程

TDengine docker安裝方法

vue3組件通信與props

sapui5

Alpine Linux apk add DNS lookup error

部分JDK版本的發佈時間

工作中用到的腳本合集

合併代碼時Beyond Compare設置

PostgreSQL 併發控制機制（4）：RR隔離級別，MySQL vs PostgreSQL

“你怎麼能活到今天？”

Google Percolator SI實現

PostgreSQL 併發控制機制（3）：基於時間戳的併發控制

PostgreSQL 併發控制機制（2）：表級鎖和行級鎖

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結