PostgreSQL 源碼解讀(115)- 後臺進程#3(checkpointer進程#2)

Return to ITPUB blog

本節簡單介紹了PostgreSQL的後臺進程:checkpointer,主要分析CreateCheckPoint函數的實現邏輯。

一、數據結構

CheckPoint
CheckPoint XLOG record結構體.

/*
 * Body of CheckPoint XLOG records.  This is declared here because we keep
 * a copy of the latest one in pg_control for possible disaster recovery.
 * Changing this struct requires a PG_CONTROL_VERSION bump.
 * CheckPoint XLOG record結構體.
 * 在這裏聲明是因爲我們在pg_control中保存了最新的副本,
 *   以便進行可能的災難恢復。
 * 改變這個結構體需要一個PG_CONTROL_VERSION bump。
 */
typedef struct CheckPoint
{
    //在開始創建CheckPoint時下一個可用的RecPtr(比如REDO的開始點)
    XLogRecPtr  redo;           /* next RecPtr available when we began to
                                 * create CheckPoint (i.e. REDO start point) */
    //當前的時間線
    TimeLineID  ThisTimeLineID; /* current TLI */
    //上一個時間線(如該記錄正在開啓一條新的時間線,否則等於當前時間線)
    TimeLineID  PrevTimeLineID; /* previous TLI, if this record begins a new
                                 * timeline (equals ThisTimeLineID otherwise) */
    //是否full-page-write
    bool        fullPageWrites; /* current full_page_writes */
    //nextXid的高階位
    uint32      nextXidEpoch;   /* higher-order bits of nextXid */
    //下一個free的XID
    TransactionId nextXid;      /* next free XID */
    //下一個free的OID
    Oid         nextOid;        /* next free OID */
    //下一個fredd的MultiXactId
    MultiXactId nextMulti;      /* next free MultiXactId */
    //下一個空閒的MultiXact偏移
    MultiXactOffset nextMultiOffset;    /* next free MultiXact offset */
    //集羣範圍內的最小datfrozenxid
    TransactionId oldestXid;    /* cluster-wide minimum datfrozenxid */
    //最小datfrozenxid所在的database
    Oid         oldestXidDB;    /* database with minimum datfrozenxid */
    //集羣範圍內的最小datminmxid
    MultiXactId oldestMulti;    /* cluster-wide minimum datminmxid */
    //最小datminmxid所在的database
    Oid         oldestMultiDB;  /* database with minimum datminmxid */
    //checkpoint的時間戳
    pg_time_t   time;           /* time stamp of checkpoint */
    //帶有有效提交時間戳的最老Xid
    TransactionId oldestCommitTsXid;    /* oldest Xid with valid commit
                                         * timestamp */
    //帶有有效提交時間戳的最新Xid
    TransactionId newestCommitTsXid;    /* newest Xid with valid commit
                                         * timestamp */

    /*
     * Oldest XID still running. This is only needed to initialize hot standby
     * mode from an online checkpoint, so we only bother calculating this for
     * online checkpoints and only when wal_level is replica. Otherwise it's
     * set to InvalidTransactionId.
     * 最老的XID還在運行。
     * 這只需要從online checkpoint初始化熱備模式,因此我們只需要爲在線檢查點計算此值,
     *   並且只在wal_level是replica時才計算此值。
     * 否則它被設置爲InvalidTransactionId。
     */
    TransactionId oldestActiveXid;
} CheckPoint;

/* XLOG info values for XLOG rmgr */
#define XLOG_CHECKPOINT_SHUTDOWN        0x00
#define XLOG_CHECKPOINT_ONLINE          0x10
#define XLOG_NOOP                       0x20
#define XLOG_NEXTOID                    0x30
#define XLOG_SWITCH                     0x40
#define XLOG_BACKUP_END                 0x50
#define XLOG_PARAMETER_CHANGE           0x60
#define XLOG_RESTORE_POINT              0x70
#define XLOG_FPW_CHANGE                 0x80
#define XLOG_END_OF_RECOVERY            0x90
#define XLOG_FPI_FOR_HINT               0xA0
#define XLOG_FPI                        0xB0

CheckpointerShmem
checkpointer進程和其他後臺進程之間通訊的共享內存結構.

/*----------
 * Shared memory area for communication between checkpointer and backends
 * checkpointer進程和其他後臺進程之間通訊的共享內存結構.
 *
 * The ckpt counters allow backends to watch for completion of a checkpoint
 * request they send.  Here's how it works:
 *  * At start of a checkpoint, checkpointer reads (and clears) the request
 *    flags and increments ckpt_started, while holding ckpt_lck.
 *  * On completion of a checkpoint, checkpointer sets ckpt_done to
 *    equal ckpt_started.
 *  * On failure of a checkpoint, checkpointer increments ckpt_failed
 *    and sets ckpt_done to equal ckpt_started.
 * ckpt計數器可以讓後臺進程監控它們發出來的checkpoint請求是否已完成.其工作原理如下:
 *  * 在checkpoint啓動階段,checkpointer進程獲取並持有ckpt_lck鎖後,
 *    讀取(並清除)請求標誌並增加ckpt_started計數.
 *  * checkpoint成功完成時,checkpointer設置ckpt_done值等於ckpt_started.
 *  * checkpoint如執行失敗,checkpointer增加ckpt_failed計數,並設置ckpt_done值等於ckpt_started.
 *
 * The algorithm for backends is:
 *  1. Record current values of ckpt_failed and ckpt_started, and
 *     set request flags, while holding ckpt_lck.
 *  2. Send signal to request checkpoint.
 *  3. Sleep until ckpt_started changes.  Now you know a checkpoint has
 *     begun since you started this algorithm (although *not* that it was
 *     specifically initiated by your signal), and that it is using your flags.
 *  4. Record new value of ckpt_started.
 *  5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
 *     arithmetic here in case counters wrap around.)  Now you know a
 *     checkpoint has started and completed, but not whether it was
 *     successful.
 *  6. If ckpt_failed is different from the originally saved value,
 *     assume request failed; otherwise it was definitely successful.
 * 算法如下:
 *  1.獲取並持有ckpt_lck鎖後,記錄ckpt_failed和ckpt_started的當前值,並設置請求標誌.
 *  2.發送信號,請求checkpoint.
 *  3.休眠直至ckpt_started發生變化.
 *    現在您知道自您啓動此算法以來檢查點已經開始(儘管*不是*它是由您的信號具體發起的),並且它正在使用您的標誌。
 *  4.記錄ckpt_started的新值.
 *  5.休眠,直至ckpt_done >= 已保存的ckpt_started值(取模).現在已知checkpoint已啓動&已完成,但checkpoint不一定成功.
 *  6.如果ckpt_failed與原來保存的值不同,則可以認爲請求失敗,否則它肯定是成功的.
 *
 * ckpt_flags holds the OR of the checkpoint request flags sent by all
 * requesting backends since the last checkpoint start.  The flags are
 * chosen so that OR'ing is the correct way to combine multiple requests.
 * ckpt_flags保存自上次檢查點啓動以來所有後臺進程發送的檢查點請求標誌的OR或標記。
 * 選擇標誌,以便OR'ing是組合多個請求的正確方法。
 * 
 * num_backend_writes is used to count the number of buffer writes performed
 * by user backend processes.  This counter should be wide enough that it
 * can't overflow during a single processing cycle.  num_backend_fsync
 * counts the subset of those writes that also had to do their own fsync,
 * because the checkpointer failed to absorb their request.
 * num_backend_writes用於計算用戶後臺進程寫入的緩衝區個數.
 * 在一個單獨的處理過程中,該計數器必須足夠大以防溢出.
 * num_backend_fsync計數那些必須執行fsync寫操作的子集,
 *   因爲checkpointer進程未能接受它們的請求。
 *
 * The requests array holds fsync requests sent by backends and not yet
 * absorbed by the checkpointer.
 * 請求數組存儲後臺進程發出的未被checkpointer進程拒絕的fsync請求.
 *
 * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
 * the requests fields are protected by CheckpointerCommLock.
 * 不同於checkpoint域,num_backend_writes/num_backend_fsync通過CheckpointerCommLock保護.
 * 
 *----------
 */
typedef struct
{
    RelFileNode rnode;//表空間/數據庫/Relation信息
    ForkNumber  forknum;//fork編號
    BlockNumber segno;          /* see md.c for special values */
    /* might add a real request-type field later; not needed yet */
} CheckpointerRequest;

typedef struct
{
    //checkpoint進程的pid(爲0則進程未啓動)
    pid_t       checkpointer_pid;   /* PID (0 if not started) */
    //用於保護所有的ckpt_*域
    slock_t     ckpt_lck;       /* protects all the ckpt_* fields */
    //在checkpoint啓動時計數
    int         ckpt_started;   /* advances when checkpoint starts */
    //在checkpoint完成時計數
    int         ckpt_done;      /* advances when checkpoint done */
    //在checkpoint失敗時計數
    int         ckpt_failed;    /* advances when checkpoint fails */
    //檢查點標記,在xlog.h中定義
    int         ckpt_flags;     /* checkpoint flags, as defined in xlog.h */
    //計數後臺進程緩存寫的次數
    uint32      num_backend_writes; /* counts user backend buffer writes */
    //計數後臺進程fsync調用次數
    uint32      num_backend_fsync;  /* counts user backend fsync calls */
    //當前的請求編號
    int         num_requests;   /* current # of requests */
    //最大的請求編號
    int         max_requests;   /* allocated array size */
    //請求數組
    CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
} CheckpointerShmemStruct;
//靜態變量(CheckpointerShmemStruct結構體指針)
static CheckpointerShmemStruct *CheckpointerShmem;

VirtualTransactionId
最頂層的事務通過VirtualTransactionIDs定義.

/*
 * Top-level transactions are identified by VirtualTransactionIDs comprising
 * the BackendId of the backend running the xact, plus a locally-assigned
 * LocalTransactionId.  These are guaranteed unique over the short term,
 * but will be reused after a database restart; hence they should never
 * be stored on disk.
 * 最高層的事務通過VirtualTransactionIDs定義.
 * VirtualTransactionIDs由執行事務的後臺進程BackendId和邏輯分配的LocalTransactionId組成.
 *
 * Note that struct VirtualTransactionId can not be assumed to be atomically
 * assignable as a whole.  However, type LocalTransactionId is assumed to
 * be atomically assignable, and the backend ID doesn't change often enough
 * to be a problem, so we can fetch or assign the two fields separately.
 * We deliberately refrain from using the struct within PGPROC, to prevent
 * coding errors from trying to use struct assignment with it; instead use
 * GET_VXID_FROM_PGPROC().
 * 請注意,不能假設struct VirtualTransactionId作爲一個整體是原子可分配的。
 * 但是,類型LocalTransactionId是假定原子可分配的,同時後臺進程ID不會經常變換,因此這不是一個問題,
 *   因此我們可以單獨提取或者分配這兩個域字段.
 * 
 */
typedef struct
{
    BackendId   backendId;      /* determined at backend startup */
    LocalTransactionId localTransactionId;  /* backend-local transaction id */
} VirtualTransactionId;

二、源碼解讀

CreateCheckPoint函數,執行checkpoint,不管是在shutdown過程還是在運行中.


/*
 * Perform a checkpoint --- either during shutdown, or on-the-fly
 * 執行checkpoint,不管是在shutdown過程還是在運行中 
 *
 * flags is a bitwise OR of the following:
 *  CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
 *  CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
 *  CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
 *      ignoring checkpoint_completion_target parameter.
 *  CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occurred
 *      since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
 *      CHECKPOINT_END_OF_RECOVERY).
 *  CHECKPOINT_FLUSH_ALL: also flush buffers of unlogged tables.
 * flags標記說明:
 *  CHECKPOINT_IS_SHUTDOWN: 數據庫關閉過程中的checkpoint
 *  CHECKPOINT_END_OF_RECOVERY: 通過WAL恢復後的checkpoint
 *  CHECKPOINT_IMMEDIATE: 儘可能快的完成checkpoint,忽略checkpoint_completion_target參數
 *  CHECKPOINT_FORCE: 在最後一次checkpoint後就算沒有任何的XLOG活動發生,也強制執行checkpoint
 *                    (意味着CHECKPOINT_IS_SHUTDOWN或CHECKPOINT_END_OF_RECOVERY)
 *  CHECKPOINT_FLUSH_ALL: 包含unlogged tables一併刷盤
 *
 * Note: flags contains other bits, of interest here only for logging purposes.
 * In particular note that this routine is synchronous and does not pay
 * attention to CHECKPOINT_WAIT.
 * 注意:標誌還包含其他位,此處僅用於日誌記錄。
 * 特別注意的是該過程同步執行,並不會理會CHECKPOINT_WAIT.
 *
 * If !shutdown then we are writing an online checkpoint. This is a very special
 * kind of operation and WAL record because the checkpoint action occurs over
 * a period of time yet logically occurs at just a single LSN. The logical
 * position of the WAL record (redo ptr) is the same or earlier than the
 * physical position. When we replay WAL we locate the checkpoint via its
 * physical position then read the redo ptr and actually start replay at the
 * earlier logical position. Note that we don't write *anything* to WAL at
 * the logical position, so that location could be any other kind of WAL record.
 * All of this mechanism allows us to continue working while we checkpoint.
 * As a result, timing of actions is critical here and be careful to note that
 * this function will likely take minutes to execute on a busy system.
 * 如果並不處在shutdown過程中,那麼我們會等待一個在線checkpoint.
 * 這是一種非常特殊的操作和WAL記錄,因爲檢查點操作發生在一段時間內,而邏輯上只發生在一個LSN上。
 * WAL Record(redo ptr)的邏輯位置與物理位置相同或者小於物理位置.
 * 在回放WAL的時候我們通過checkpoint的物理位置定位位置,然後讀取redo ptr,
 *   實際上在更早的邏輯位置開始回放,這樣該位置可以是任意類型的WAL Record.
 * 這種機制的目的是允許我們在checkpoint的時候不需要暫停.
 * 這種機制的結果是操作的時間會比較長,要小心的是在繁忙的系統中,該操作可能會持續數分鐘.
 */
void
CreateCheckPoint(int flags)
{
    bool        shutdown;//是否處於shutdown?
    CheckPoint  checkPoint;//checkpoint
    XLogRecPtr  recptr;//XLOG Record位置
    XLogSegNo   _logSegNo;//LSN(uint64)
    XLogCtlInsert *Insert = &XLogCtl->Insert;//控制器
    uint32      freespace;//空閒空間
    XLogRecPtr  PriorRedoPtr;//上一個Redo point
    XLogRecPtr  curInsert;//當前插入的位置
    XLogRecPtr  last_important_lsn;//上一個重要的LSN
    VirtualTransactionId *vxids;//虛擬事務ID
    int         nvxids;

    /*
     * An end-of-recovery checkpoint is really a shutdown checkpoint, just
     * issued at a different time.
     * end-of-recovery checkpoint事實上是shutdown checkpoint,只不過是在一個不同的時間發生的.
     */
    if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
        shutdown = true;
    else
        shutdown = false;

    /* sanity check */
    //驗證
    if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
        elog(ERROR, "can't create a checkpoint during recovery");

    /*
     * Initialize InitXLogInsert working areas before entering the critical
     * section.  Normally, this is done by the first call to
     * RecoveryInProgress() or LocalSetXLogInsertAllowed(), but when creating
     * an end-of-recovery checkpoint, the LocalSetXLogInsertAllowed call is
     * done below in a critical section, and InitXLogInsert cannot be called
     * in a critical section.
     * 在進入critical section前,初始化InitXLogInsert工作空間.
     * 通常來說,第一次調用RecoveryInProgress() or LocalSetXLogInsertAllowed()時已完成,
     *   但在創建end-of-recovery checkpoint時,在下面的邏輯中LocalSetXLogInsertAllowed調用完成時,
     *   InitXLogInsert不能在critical section中調用.
     */
    InitXLogInsert();

    /*
     * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
     * (This is just pro forma, since in the present system structure there is
     * only one process that is allowed to issue checkpoints at any given
     * time.)
     * 請求CheckpointLock確保在同一時刻只能存在一個checkpoint.
     * (這只是形式上的,因爲在目前的系統架構中,在任何給定的時間只允許一個進程發出檢查點。)
     */
    LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

    /*
     * Prepare to accumulate statistics.
     * 爲統計做準備.
     *
     * Note: because it is possible for log_checkpoints to change while a
     * checkpoint proceeds, we always accumulate stats, even if
     * log_checkpoints is currently off.
     * 注意:在checkpoint執行過程總,log_checkpoints可能會出現變化,
     *   因此我們通常會累計stats,即使log_checkpoints爲off
     */
    MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
    CheckpointStats.ckpt_start_t = GetCurrentTimestamp();

    /*
     * Use a critical section to force system panic if we have trouble.
     * 使用critical section,強制系統在出現問題時進行應對.
     */
    START_CRIT_SECTION();

    if (shutdown)
    {
        //shutdown = T
        //更新control file(pg_control文件)
        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
        ControlFile->state = DB_SHUTDOWNING;
        ControlFile->time = (pg_time_t) time(NULL);
        UpdateControlFile();
        LWLockRelease(ControlFileLock);
    }

    /*
     * Let smgr prepare for checkpoint; this has to happen before we determine
     * the REDO pointer.  Note that smgr must not do anything that'd have to
     * be undone if we decide no checkpoint is needed.
     * 讓smgr(資源管理器)爲checkpoint作準備.
     * 在確定REDO pointer時必須執行.
     * 請注意,如果我們決定不執行checkpoint,那麼smgr不能執行任何必須撤消的操作。
     */
    smgrpreckpt();

    /* Begin filling in the checkpoint WAL record */
    //填充Checkpoint XLOG Record
    MemSet(&checkPoint, 0, sizeof(checkPoint));
    checkPoint.time = (pg_time_t) time(NULL);//時間

    /*
     * For Hot Standby, derive the oldestActiveXid before we fix the redo
     * pointer. This allows us to begin accumulating changes to assemble our
     * starting snapshot of locks and transactions.
     * 對於Hot Standby,在修改redo pointer前,推導出oldestActiveXid.
     * 這可以讓我們可以累計變化以組裝開始的snapshot的locks和transactions.
     */
    if (!shutdown && XLogStandbyInfoActive())
        checkPoint.oldestActiveXid = GetOldestActiveTransactionId();
    else
        checkPoint.oldestActiveXid = InvalidTransactionId;

    /*
     * Get location of last important record before acquiring insert locks (as
     * GetLastImportantRecPtr() also locks WAL locks).
     * 在請求插入locks前,獲取最後一個重要的XLOG Record的位置.
     * (GetLastImportantRecPtr()函數會獲取WAL locks)
     */
    last_important_lsn = GetLastImportantRecPtr();

    /*
     * We must block concurrent insertions while examining insert state to
     * determine the checkpoint REDO pointer.
     * 在檢查插入狀態確定checkpoint的REDO pointer時,必須阻塞同步插入操作.
     */
    WALInsertLockAcquireExclusive();
    curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);

    /*
     * If this isn't a shutdown or forced checkpoint, and if there has been no
     * WAL activity requiring a checkpoint, skip it.  The idea here is to
     * avoid inserting duplicate checkpoints when the system is idle.
     * 不是shutdow或強制checkpoint,而且在請求時如果沒有WAL活動,則跳過.
     * 這裏的思想是避免在系統空閒時插入重複的checkpoints
     */
    if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
                  CHECKPOINT_FORCE)) == 0)
    {
        if (last_important_lsn == ControlFile->checkPoint)
        {
            WALInsertLockRelease();
            LWLockRelease(CheckpointLock);
            END_CRIT_SECTION();
            ereport(DEBUG1,
                    (errmsg("checkpoint skipped because system is idle")));
            return;
        }
    }

    /*
     * An end-of-recovery checkpoint is created before anyone is allowed to
     * write WAL. To allow us to write the checkpoint record, temporarily
     * enable XLogInsertAllowed.  (This also ensures ThisTimeLineID is
     * initialized, which we need here and in AdvanceXLInsertBuffer.)
     * 在允許寫入WAL後纔會創建end-of-recovery checkpoint.
     * 這可以讓我們寫Checkpoint Record,臨時啓用XLogInsertAllowed.
     * (這同樣可以確保已初始化在這裏和AdvanceXLInsertBuffer中需要的變量ThisTimeLineID)
     */
    if (flags & CHECKPOINT_END_OF_RECOVERY)
        LocalSetXLogInsertAllowed();

    checkPoint.ThisTimeLineID = ThisTimeLineID;
    if (flags & CHECKPOINT_END_OF_RECOVERY)
        checkPoint.PrevTimeLineID = XLogCtl->PrevTimeLineID;
    else
        checkPoint.PrevTimeLineID = ThisTimeLineID;

    checkPoint.fullPageWrites = Insert->fullPageWrites;

    /*
     * Compute new REDO record ptr = location of next XLOG record.
     * 計算新的REDO record ptr = 下一個XLOG Record的位置.
     * 
     * NB: this is NOT necessarily where the checkpoint record itself will be,
     * since other backends may insert more XLOG records while we're off doing
     * the buffer flush work.  Those XLOG records are logically after the
     * checkpoint, even though physically before it.  Got that?
     * 注意:這並不一定是檢查點記錄本身所在的位置,因爲當我們停止緩衝區刷新工作時,
     *   其他後臺進程可能會插入更多的XLOG Record。
     * 這些XLOG Records邏輯上會在checkpoint之後,雖然物理上可能在checkpoint之前.
     */
    freespace = INSERT_FREESPACE(curInsert);//獲取空閒空間
    if (freespace == 0)
    {
        //沒有空閒空間了
        if (XLogSegmentOffset(curInsert, wal_segment_size) == 0)
            curInsert += SizeOfXLogLongPHD;//新的WAL segment file,偏移爲LONG header
        else
            curInsert += SizeOfXLogShortPHD;//原WAL segment file,偏移爲常規的header
    }
    checkPoint.redo = curInsert;

    /*
     * Here we update the shared RedoRecPtr for future XLogInsert calls; this
     * must be done while holding all the insertion locks.
     * 在這裏,我們更新共享的RedoRecPtr以備將來的XLogInsert調用;
     *   這必須在持有所有插入鎖才能完成。
     *
     * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
     * pointing past where it really needs to point.  This is okay; the only
     * consequence is that XLogInsert might back up whole buffers that it
     * didn't really need to.  We can't postpone advancing RedoRecPtr because
     * XLogInserts that happen while we are dumping buffers must assume that
     * their buffer changes are not included in the checkpoint.
     * 注意:如果checkpoint失敗,RedoRecPtr仍會指向實際上它應指向的位置.
     * 這種做法沒有問題,唯一需要處理的XLogInsert可能會備份它並不真正需要的整個緩衝區.
     * 我們不能推遲推進RedoRecPtr,因爲在轉儲緩衝區時發生的XLogInserts,
     *   必須假設它們的緩衝區更改不包含在該檢查點中。
     */
    RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;

    /*
     * Now we can release the WAL insertion locks, allowing other xacts to
     * proceed while we are flushing disk buffers.
     * 現在可以釋放WAL插入鎖,允許其他事務在刷新磁盤緩衝區時可以執行.
     */
    WALInsertLockRelease();

    /* Update the info_lck-protected copy of RedoRecPtr as well */
    //同時,更新RedoRecPtr的info_lck-protected拷貝鎖.
    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->RedoRecPtr = checkPoint.redo;
    SpinLockRelease(&XLogCtl->info_lck);

    /*
     * If enabled, log checkpoint start.  We postpone this until now so as not
     * to log anything if we decided to skip the checkpoint.
     * 如啓用log_checkpoints,則記錄checkpoint日誌啓動.
     * 我們將此推遲到現在,以便在決定跳過檢查點時不記錄任何東西。
     */
    if (log_checkpoints)
        LogCheckpointStart(flags, false);

    TRACE_POSTGRESQL_CHECKPOINT_START(flags);

    /*
     * Get the other info we need for the checkpoint record.
     * 獲取其他組裝checkpoint記錄的信息.
     *
     * We don't need to save oldestClogXid in the checkpoint, it only matters
     * for the short period in which clog is being truncated, and if we crash
     * during that we'll redo the clog truncation and fix up oldestClogXid
     * there.
     * 我們不需要在檢查點中保存oldestClogXid,它只在截斷clog的短時間內起作用,
     *   如果在此期間崩潰,我們將重新截斷clog並在修復oldestClogXid。
     */
    LWLockAcquire(XidGenLock, LW_SHARED);
    checkPoint.nextXid = ShmemVariableCache->nextXid;
    checkPoint.oldestXid = ShmemVariableCache->oldestXid;
    checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
    LWLockRelease(XidGenLock);

    LWLockAcquire(CommitTsLock, LW_SHARED);
    checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
    checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
    LWLockRelease(CommitTsLock);

    /* Increase XID epoch if we've wrapped around since last checkpoint */
    //如果我們從上一個checkpoint開始wrapped around,則增加XID epoch
    checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
    if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
        checkPoint.nextXidEpoch++;

    LWLockAcquire(OidGenLock, LW_SHARED);
    checkPoint.nextOid = ShmemVariableCache->nextOid;
    if (!shutdown)
        checkPoint.nextOid += ShmemVariableCache->oidCount;
    LWLockRelease(OidGenLock);

    MultiXactGetCheckptMulti(shutdown,
                             &checkPoint.nextMulti,
                             &checkPoint.nextMultiOffset,
                             &checkPoint.oldestMulti,
                             &checkPoint.oldestMultiDB);

    /*
     * Having constructed the checkpoint record, ensure all shmem disk buffers
     * and commit-log buffers are flushed to disk.
     * 在構造checkpoint XLOG Record之後,確保所有shmem disk buffers和clog緩衝區都被刷到磁盤中。
     *
     * This I/O could fail for various reasons.  If so, we will fail to
     * complete the checkpoint, but there is no reason to force a system
     * panic. Accordingly, exit critical section while doing it.
     * 刷盤I/O可能會因爲很多原因失敗.
     * 如果出現問題,那麼checkpoint會失敗,但沒有理由強制要求系統panic.
     * 相反,在做這些工作時退出critical section.
     */
    END_CRIT_SECTION();

    /*
     * In some cases there are groups of actions that must all occur on one
     * side or the other of a checkpoint record. Before flushing the
     * checkpoint record we must explicitly wait for any backend currently
     * performing those groups of actions.
     * 在某些情況下,必須在checkpoint XLOG Record的一邊或另一邊執行一組操作。
     * 在刷新checkpoint XLOG Record之前,我們必須顯式地等待當前執行這些操作組的所有後臺進程。
     *
     * One example is end of transaction, so we must wait for any transactions
     * that are currently in commit critical sections.  If an xact inserted
     * its commit record into XLOG just before the REDO point, then a crash
     * restart from the REDO point would not replay that record, which means
     * that our flushing had better include the xact's update of pg_xact.  So
     * we wait till he's out of his commit critical section before proceeding.
     * See notes in RecordTransactionCommit().
     * 其中一個例子是事務結束,我們必須等待當前正處於commit critical sections的事務結束.
     * 如果某個事務正好在REDO point前插入commit record到XLOG中,
     *   如果系統crash,則重啓後,從REDO point起讀取時不會回放該commit記錄,
     *   這意味着我們的刷盤最好包含xact對pg_xact的更新.
     * 所以我們要等到該進程離開commit critical section後再繼續。
     * 參見RecordTransactionCommit()中的註釋。
     *
     * Because we've already released the insertion locks, this test is a bit
     * fuzzy: it is possible that we will wait for xacts we didn't really need
     * to wait for.  But the delay should be short and it seems better to make
     * checkpoint take a bit longer than to hold off insertions longer than
     * necessary. (In fact, the whole reason we have this issue is that xact.c
     * does commit record XLOG insertion and clog update as two separate steps
     * protected by different locks, but again that seems best on grounds of
     * minimizing lock contention.)
     * 因爲我們已經釋放了插入鎖,這個測試有點模糊:有可能我們將等待我們實際上不需要等待的xacts。
     * 但是延遲應該很短,讓檢查點花費的時間比延遲插入所需的時間長一些似乎更好。
     * (實際上,我們遇到這個問題的原因是xact.c將commit record XLOG插入和clog更新作爲兩個單獨的步驟提交,
     *  這兩個操作由不同的鎖進行保護,但基於最小化鎖爭用的理由這看起來是最好的。)
     *
     * A transaction that has not yet set delayChkpt when we look cannot be at
     * risk, since he's not inserted his commit record yet; and one that's
     * already cleared it is not at risk either, since he's done fixing clog
     * and we will correctly flush the update below.  So we cannot miss any
     * xacts we need to wait for.
     * 在我們搜索時,尚未設置delayChkpt的事務不會存在風險,因爲該事務還沒有插入它的提交記錄;
     * 同樣的已清除了delayChkpt的事務也不會有風險,因爲該事務已修改了clog,
     *   我們可以正確的在下面的處理邏輯中刷新更新.
     * 因此我們不能錯失我們需要等待的所有xacts.
     */
    vxids = GetVirtualXIDsDelayingChkpt(&nvxids);//獲取虛擬事務XID
    if (nvxids > 0)
    {
        do
        {
            //等待10ms
            pg_usleep(10000L);  /* wait for 10 msec */
        } while (HaveVirtualXIDsDelayingChkpt(vxids, nvxids));
    }
    pfree(vxids);
    //把共享內存中的數據刷到磁盤上,並執行fsync
    CheckPointGuts(checkPoint.redo, flags);

    /*
     * Take a snapshot of running transactions and write this to WAL. This
     * allows us to reconstruct the state of running transactions during
     * archive recovery, if required. Skip, if this info disabled.
     * 獲取正在運行的事務的快照,並將其寫入WAL。
     * 如果需要,這允許我們在歸檔恢復期間重建正在運行的事務的狀態。
     * 如果禁用此消息,則禁用。
     * 
     * If we are shutting down, or Startup process is completing crash
     * recovery we don't need to write running xact data.
     * 如果正在關閉數據庫,或者啓動進程已完成crash recovery,
     *   則不需要寫正在運行的事務數據.
     */
    if (!shutdown && XLogStandbyInfoActive())
        LogStandbySnapshot();

    START_CRIT_SECTION();//進入critical section.

    /*
     * Now insert the checkpoint record into XLOG.
     * 現在可以插入checkpoint record到XLOG中了.
     */
    XLogBeginInsert();//開始插入
    XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));//註冊數據
    recptr = XLogInsert(RM_XLOG_ID,
                        shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
                        XLOG_CHECKPOINT_ONLINE);//執行插入

    XLogFlush(recptr);//刷盤

    /*
     * We mustn't write any new WAL after a shutdown checkpoint, or it will be
     * overwritten at next startup.  No-one should even try, this just allows
     * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
     * to just temporarily disable writing until the system has exited
     * recovery.
     * 我們不能在關閉檢查點之後寫入任何新的WAL,否則它將在下一次啓動時被覆蓋。
     * 而且不應該進行這樣的嘗試,只允許健康檢查。
     * 在end-of-recovery checkpoint情況下,我們只想暫時禁用寫入,直到系統退出恢復。
     */
    if (shutdown)
    {
        //關閉過程中
        if (flags & CHECKPOINT_END_OF_RECOVERY)
            LocalXLogInsertAllowed = -1;    /* return to "check" state */
        else
            LocalXLogInsertAllowed = 0; /* never again write WAL */
    }

    /*
     * We now have ProcLastRecPtr = start of actual checkpoint record, recptr
     * = end of actual checkpoint record.
     * 現在我們有:
     *   ProcLastRecPtr = 實際的checkpoint XLOG record的起始位置,
     *   recptr = 實際checkpoint XLOG record的結束位置.
     */
    if (shutdown && checkPoint.redo != ProcLastRecPtr)
        ereport(PANIC,
                (errmsg("concurrent write-ahead log activity while database system is shutting down")));

    /*
     * Remember the prior checkpoint's redo ptr for
     * UpdateCheckPointDistanceEstimate()
     * 爲UpdateCheckPointDistanceEstimate()記錄上一個checkpoint的REDO ptr
     */
    PriorRedoPtr = ControlFile->checkPointCopy.redo;

    /*
     * Update the control file.
     * 更新控制文件(pg_control)
     */
    LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    if (shutdown)
        ControlFile->state = DB_SHUTDOWNED;
    ControlFile->checkPoint = ProcLastRecPtr;
    ControlFile->checkPointCopy = checkPoint;
    ControlFile->time = (pg_time_t) time(NULL);
    /* crash recovery should always recover to the end of WAL */
    //crash recovery通常來說應恢復至WAL的末尾
    ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
    ControlFile->minRecoveryPointTLI = 0;

    /*
     * Persist unloggedLSN value. It's reset on crash recovery, so this goes
     * unused on non-shutdown checkpoints, but seems useful to store it always
     * for debugging purposes.
     * 持久化unloggedLSN值.
     * 它是在崩潰恢復時重置的,因此在非關閉檢查點上不使用,但是爲了調試目的而總是存儲它似乎很有用。
     */
    SpinLockAcquire(&XLogCtl->ulsn_lck);
    ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
    SpinLockRelease(&XLogCtl->ulsn_lck);

    UpdateControlFile();
    LWLockRelease(ControlFileLock);

    /* Update shared-memory copy of checkpoint XID/epoch */
    //更新checkpoint XID/epoch的共享內存拷貝
    SpinLockAcquire(&XLogCtl->info_lck);
    XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
    XLogCtl->ckptXid = checkPoint.nextXid;
    SpinLockRelease(&XLogCtl->info_lck);

    /*
     * We are now done with critical updates; no need for system panic if we
     * have trouble while fooling with old log segments.
     * 已完成critical updates.
     */
    END_CRIT_SECTION();

    /*
     * Let smgr do post-checkpoint cleanup (eg, deleting old files).
     * 讓smgr執行checkpoint收尾工作(比如刪除舊文件等).
     */
    smgrpostckpt();

    /*
     * Update the average distance between checkpoints if the prior checkpoint
     * exists.
     * 如上一個checkpoint存在,則更新兩者之間的平均距離.
     */
    if (PriorRedoPtr != InvalidXLogRecPtr)
        UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);

    /*
     * Delete old log files, those no longer needed for last checkpoint to
     * prevent the disk holding the xlog from growing full.
     * 刪除舊的日誌文件,這些文件自最後一個檢查點後已不再需要,
     *   以防止保存xlog的磁盤撐滿。
     */
    XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
    KeepLogSeg(recptr, &_logSegNo);
    _logSegNo--;
    RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);

    /*
     * Make more log segments if needed.  (Do this after recycling old log
     * segments, since that may supply some of the needed files.)
     * 如需要,申請更多的log segments.
     * (在循環使用舊的log segments時纔來做這個事情,因爲那樣會需要一些需要的文件)
     */
    if (!shutdown)
        PreallocXlogFiles(recptr);

    /*
     * Truncate pg_subtrans if possible.  We can throw away all data before
     * the oldest XMIN of any running transaction.  No future transaction will
     * attempt to reference any pg_subtrans entry older than that (see Asserts
     * in subtrans.c).  During recovery, though, we mustn't do this because
     * StartupSUBTRANS hasn't been called yet.
     * 如可能,截斷pg_subtrans.
     * 我們可以在任何正在運行的事務的最老的XMIN之前丟棄所有數據。
     * 以後的事務都不會嘗試引用任何比這更早的pg_subtrans條目(參見sub.c中的斷言)。
     * 但是在恢復期間,我們不能這樣做,因爲StartupSUBTRANS還沒有被調用。
     * 
     */
    if (!RecoveryInProgress())
        TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));

    /* Real work is done, but log and update stats before releasing lock. */
    //實際的工作已完成,除了記錄日誌已經更新統計信息.
    LogCheckpointEnd(false);

    TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                     NBuffers,
                                     CheckpointStats.ckpt_segs_added,
                                     CheckpointStats.ckpt_segs_removed,
                                     CheckpointStats.ckpt_segs_recycled);
    //釋放鎖
    LWLockRelease(CheckpointLock);
}

/*
 * Flush all data in shared memory to disk, and fsync
 * 把共享內存中的數據刷到磁盤上,並執行fsync
 *
 * This is the common code shared between regular checkpoints and
 * recovery restartpoints.
 * 不管是普通的checkpoints還是recovery restartpoints,這些代碼都是共享的.
 */
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
    CheckPointCLOG();
    CheckPointCommitTs();
    CheckPointSUBTRANS();
    CheckPointMultiXact();
    CheckPointPredicate();
    CheckPointRelationMap();
    CheckPointReplicationSlots();
    CheckPointSnapBuild();
    CheckPointLogicalRewriteHeap();
    CheckPointBuffers(flags);   /* performs all required fsyncs */
    CheckPointReplicationOrigin();
    /* We deliberately delay 2PC checkpointing as long as possible */
    CheckPointTwoPhase(checkPointRedo);
}

三、跟蹤分析

更新數據,執行checkpoint.

testdb=# update t_wal_ckpt set c2 = 'C2_'||substr(c2,4,40);
UPDATE 1
testdb=# checkpoint;

啓動gdb,設置信號控制,設置斷點,進入CreateCheckPoint

(gdb) handle SIGINT print nostop pass
SIGINT is used by the debugger.
Are you sure you want to change it? (y or n) y
Signal        Stop  Print Pass to program Description
SIGINT        No  Yes Yes   Interrupt
(gdb) 
(gdb) b CreateCheckPoint
Breakpoint 1 at 0x55b4fb: file xlog.c, line 8668.
(gdb) c
Continuing.

Program received signal SIGINT, Interrupt.

Breakpoint 1, CreateCheckPoint (flags=44) at xlog.c:8668
8668        XLogCtlInsert *Insert = &XLogCtl->Insert;
(gdb) 

獲取XLOG插入控制器

8668        XLogCtlInsert *Insert = &XLogCtl->Insert;
(gdb) n
8680        if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
(gdb) p XLogCtl
$1 = (XLogCtlData *) 0x7fadf8f6fa80
(gdb) p *XLogCtl
$2 = {Insert = {insertpos_lck = 0 '\000', CurrBytePos = 5505269968, PrevBytePos = 5505269928, 
    pad = '\000' <repeats 127 times>, RedoRecPtr = 5521450856, forcePageWrites = false, fullPageWrites = true, 
    exclusiveBackupState = EXCLUSIVE_BACKUP_NONE, nonExclusiveBackups = 0, lastBackupStart = 0, 
    WALInsertLocks = 0x7fadf8f74100}, LogwrtRqst = {Write = 5521451392, Flush = 5521451392}, RedoRecPtr = 5521450856, 
  ckptXidEpoch = 0, ckptXid = 2307, asyncXactLSN = 5521363848, replicationSlotMinLSN = 0, lastRemovedSegNo = 0, 
  unloggedLSN = 1, ulsn_lck = 0 '\000', lastSegSwitchTime = 1546915130, lastSegSwitchLSN = 5521363360, LogwrtResult = {
    Write = 5521451392, Flush = 5521451392}, InitializedUpTo = 5538226176, pages = 0x7fadf8f76000 "\230\320\006", 
  xlblocks = 0x7fadf8f70088, XLogCacheBlck = 2047, ThisTimeLineID = 1, PrevTimeLineID = 1, 
  archiveCleanupCommand = '\000' <repeats 1023 times>, SharedRecoveryInProgress = false, SharedHotStandbyActive = false, 
  WalWriterSleeping = true, recoveryWakeupLatch = {is_set = 0, is_shared = true, owner_pid = 0}, lastCheckPointRecPtr = 0, 
  lastCheckPointEndPtr = 0, lastCheckPoint = {redo = 0, ThisTimeLineID = 0, PrevTimeLineID = 0, fullPageWrites = false, 
    nextXidEpoch = 0, nextXid = 0, nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, 
    oldestMulti = 0, oldestMultiDB = 0, time = 0, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}, 
  lastReplayedEndRecPtr = 0, lastReplayedTLI = 0, replayEndRecPtr = 0, replayEndTLI = 0, recoveryLastXTime = 0, 
  currentChunkStartTime = 0, recoveryPause = false, lastFpwDisableRecPtr = 0, info_lck = 0 '\000'}
(gdb) p *Insert
$4 = {insertpos_lck = 0 '\000', CurrBytePos = 5505269968, PrevBytePos = 5505269928, pad = '\000' <repeats 127 times>, 
  RedoRecPtr = 5521450856, forcePageWrites = false, fullPageWrites = true, exclusiveBackupState = EXCLUSIVE_BACKUP_NONE, 
  nonExclusiveBackups = 0, lastBackupStart = 0, WALInsertLocks = 0x7fadf8f74100}
(gdb)   

RedoRecPtr = 5521450856,這是REDO point,與pg_control文件中的值一致

[xdb@localhost ~]$ echo "obase=16;ibase=10;5521450856"|bc
1491AA768
[xdb@localhost ~]$ pg_controldata|grep REDO
Latest checkpoint's REDO location:    1/491AA768
Latest checkpoint's REDO WAL file:    000000010000000100000049
[xdb@localhost ~]$ 

在進入critical section前,初始化InitXLogInsert工作空間.
請求CheckpointLock確保在同一時刻只能存在一個checkpoint.

(gdb) n
8683            shutdown = false;
(gdb) 
8686        if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
(gdb) 
8697        InitXLogInsert();
(gdb) 
8705        LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
(gdb) 
8714        MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
(gdb) 
8715        CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
(gdb) 

進入critical section,讓smgr(資源管理器)爲checkpoint作準備.

8720        START_CRIT_SECTION();
(gdb) 
(gdb) 
8722        if (shutdown)
(gdb) 
8736        smgrpreckpt();
(gdb) 
8739        MemSet(&checkPoint, 0, sizeof(checkPoint));
(gdb) 

開始填充Checkpoint XLOG Record

(gdb) 
8740        checkPoint.time = (pg_time_t) time(NULL);
(gdb) p checkPoint
$5 = {redo = 0, ThisTimeLineID = 0, PrevTimeLineID = 0, fullPageWrites = false, nextXidEpoch = 0, nextXid = 0, nextOid = 0, 
  nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, oldestMulti = 0, oldestMultiDB = 0, time = 0, 
  oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) n
8747        if (!shutdown && XLogStandbyInfoActive())
(gdb) 
8750            checkPoint.oldestActiveXid = InvalidTransactionId;

在請求插入locks前,獲取最後一個重要的XLOG Record的位置.

(gdb) 
8756        last_important_lsn = GetLastImportantRecPtr();
(gdb) 
8762        WALInsertLockAcquireExclusive();
(gdb) 
(gdb) p last_important_lsn
$6 = 5521451352 --> 0x1491AA958

在檢查插入狀態確定checkpoint的REDO pointer時,必須阻塞同步插入操作.

(gdb) n
8763        curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
(gdb) 
8770        if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
(gdb) p curInsert
$7 = 5521451392 --> 0x1491AA980
(gdb) 

繼續填充Checkpoint XLOG Record

(gdb) n
8790        if (flags & CHECKPOINT_END_OF_RECOVERY)
(gdb) 
8793        checkPoint.ThisTimeLineID = ThisTimeLineID;
(gdb) 
8794        if (flags & CHECKPOINT_END_OF_RECOVERY)
(gdb) 
8797            checkPoint.PrevTimeLineID = ThisTimeLineID;
(gdb) p ThisTimeLineID
$8 = 1
(gdb) n
8799        checkPoint.fullPageWrites = Insert->fullPageWrites;
(gdb) 
8809        freespace = INSERT_FREESPACE(curInsert);
(gdb) 
8810        if (freespace == 0)
(gdb) p freespace
$9 = 5760
(gdb) n
8817        checkPoint.redo = curInsert;
(gdb) 
8830        RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
(gdb) 
(gdb) p checkPoint
$10 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 0, 
  nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 0, oldestXidDB = 0, oldestMulti = 0, oldestMultiDB = 0, 
  time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) 

更新共享的RedoRecPtr以備將來的XLogInsert調用,必須在持有所有插入鎖才能完成。

(gdb) n
8836        WALInsertLockRelease();
(gdb) 
8839        SpinLockAcquire(&XLogCtl->info_lck);
(gdb) 
8840        XLogCtl->RedoRecPtr = checkPoint.redo;
(gdb) 
8841        SpinLockRelease(&XLogCtl->info_lck);
(gdb) 
8847        if (log_checkpoints)
(gdb) 
(gdb) p XLogCtl->RedoRecPtr
$11 = 5521451392

獲取其他組裝checkpoint記錄的信息.

(gdb) n
8850        TRACE_POSTGRESQL_CHECKPOINT_START(flags);
(gdb) 
8860        LWLockAcquire(XidGenLock, LW_SHARED);
(gdb) 
8861        checkPoint.nextXid = ShmemVariableCache->nextXid;
(gdb) 
8862        checkPoint.oldestXid = ShmemVariableCache->oldestXid;
(gdb) 
8863        checkPoint.oldestXidDB = ShmemVariableCache->oldestXidDB;
(gdb) 
8864        LWLockRelease(XidGenLock);
(gdb) 
8866        LWLockAcquire(CommitTsLock, LW_SHARED);
(gdb) 
8867        checkPoint.oldestCommitTsXid = ShmemVariableCache->oldestCommitTsXid;
(gdb) 
8868        checkPoint.newestCommitTsXid = ShmemVariableCache->newestCommitTsXid;
(gdb) 
8869        LWLockRelease(CommitTsLock);
(gdb) 
8872        checkPoint.nextXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
(gdb) n
8873        if (checkPoint.nextXid < ControlFile->checkPointCopy.nextXid)
(gdb) 
8876        LWLockAcquire(OidGenLock, LW_SHARED);
(gdb) 
8877        checkPoint.nextOid = ShmemVariableCache->nextOid;
(gdb) p checkPoint
$13 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, 
  nextOid = 0, nextMulti = 0, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 0, 
  oldestMultiDB = 0, time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) n
8878        if (!shutdown)
(gdb) 
8879            checkPoint.nextOid += ShmemVariableCache->oidCount;
(gdb) 
8880        LWLockRelease(OidGenLock);
(gdb) p *ShmemVariableCache
$14 = {nextOid = 42575, oidCount = 8189, nextXid = 2308, oldestXid = 561, xidVacLimit = 200000561, 
  xidWarnLimit = 2136484208, xidStopLimit = 2146484208, xidWrapLimit = 2147484208, oldestXidDB = 16400, 
  oldestCommitTsXid = 0, newestCommitTsXid = 0, latestCompletedXid = 2307, oldestClogXid = 561}
(gdb) n
8882        MultiXactGetCheckptMulti(shutdown,
(gdb) 

再次查看checkpoint結構體

(gdb) p checkPoint
$15 = {redo = 5521451392, ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, 
  nextOid = 50764, nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 1, 
  oldestMultiDB = 16402, time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}
(gdb) 

結束CRIT_SECTION

(gdb) 
8896        END_CRIT_SECTION();

獲取虛擬事務ID(無效的信息)

(gdb) n
8927        vxids = GetVirtualXIDsDelayingChkpt(&nvxids);
(gdb) 
8928        if (nvxids > 0)
(gdb) p vxids
$16 = (VirtualTransactionId *) 0x2f4eb20
(gdb) p *vxids
$17 = {backendId = 2139062143, localTransactionId = 2139062143}
(gdb) p nvxids
$18 = 0
(gdb) 
(gdb) n
8935        pfree(vxids);
(gdb) 

把共享內存中的數據刷到磁盤上,並執行fsync

(gdb) 
8937        CheckPointGuts(checkPoint.redo, flags);
(gdb) p flags
$19 = 44
(gdb) n
8947        if (!shutdown && XLogStandbyInfoActive())
(gdb) 

進入critical section.

(gdb) n
8950        START_CRIT_SECTION();
(gdb) 

現在可以插入checkpoint record到XLOG中了.

(gdb) 
8955        XLogBeginInsert();
(gdb) n
8956        XLogRegisterData((char *) (&checkPoint), sizeof(checkPoint));
(gdb) 
8957        recptr = XLogInsert(RM_XLOG_ID,
(gdb) 
8961        XLogFlush(recptr);
(gdb) 
8970        if (shutdown)
(gdb) 

更新控制文件(pg_control),首先爲UpdateCheckPointDistanceEstimate()記錄上一個checkpoint的REDO ptr

(gdb) 
8982        if (shutdown && checkPoint.redo != ProcLastRecPtr)
(gdb) 
8990        PriorRedoPtr = ControlFile->checkPointCopy.redo;
(gdb) 
8995        LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
(gdb) p ControlFile->checkPointCopy.redo
$20 = 5521450856
(gdb) n
8996        if (shutdown)
(gdb) 
8998        ControlFile->checkPoint = ProcLastRecPtr;
(gdb) 
8999        ControlFile->checkPointCopy = checkPoint;
(gdb) 
9000        ControlFile->time = (pg_time_t) time(NULL);
(gdb) 
9002        ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
(gdb) 
9003        ControlFile->minRecoveryPointTLI = 0;
(gdb) 
9010        SpinLockAcquire(&XLogCtl->ulsn_lck);
(gdb) 
9011        ControlFile->unloggedLSN = XLogCtl->unloggedLSN;
(gdb) 
9012        SpinLockRelease(&XLogCtl->ulsn_lck);
(gdb) 
9014        UpdateControlFile();
(gdb) 
9015        LWLockRelease(ControlFileLock);
(gdb) 
9018        SpinLockAcquire(&XLogCtl->info_lck);
(gdb) p *ControlFile
$21 = {system_identifier = 6624362124887945794, pg_control_version = 1100, catalog_version_no = 201809051, 
  state = DB_IN_PRODUCTION, time = 1546934255, checkPoint = 5521451392, checkPointCopy = {redo = 5521451392, 
    ThisTimeLineID = 1, PrevTimeLineID = 1, fullPageWrites = true, nextXidEpoch = 0, nextXid = 2308, nextOid = 50764, 
    nextMulti = 1, nextMultiOffset = 0, oldestXid = 561, oldestXidDB = 16400, oldestMulti = 1, oldestMultiDB = 16402, 
    time = 1546933255, oldestCommitTsXid = 0, newestCommitTsXid = 0, oldestActiveXid = 0}, unloggedLSN = 1, 
  minRecoveryPoint = 0, minRecoveryPointTLI = 0, backupStartPoint = 0, backupEndPoint = 0, backupEndRequired = false, 
  wal_level = 0, wal_log_hints = false, MaxConnections = 100, max_worker_processes = 8, max_prepared_xacts = 0, 
  max_locks_per_xact = 64, track_commit_timestamp = false, maxAlign = 8, floatFormat = 1234567, blcksz = 8192, 
  relseg_size = 131072, xlog_blcksz = 8192, xlog_seg_size = 16777216, nameDataLen = 64, indexMaxKeys = 32, 
  toast_max_chunk_size = 1996, loblksize = 2048, float4ByVal = true, float8ByVal = true, data_checksum_version = 0, 
  mock_authentication_nonce = "\220\277\067Vg\003\205\232U{\177 h\216\271D\266\063[\\=6\365S\tA\353\361ߧw\301", 
  crc = 930305687}
(gdb) 

更新checkpoint XID/epoch的共享內存拷貝,退出critical section,並讓smgr執行checkpoint收尾工作(比如刪除舊文件等).

(gdb) n
9019        XLogCtl->ckptXidEpoch = checkPoint.nextXidEpoch;
(gdb) 
9020        XLogCtl->ckptXid = checkPoint.nextXid;
(gdb) 
9021        SpinLockRelease(&XLogCtl->info_lck);
(gdb) 
9027        END_CRIT_SECTION();
(gdb) 
9032        smgrpostckpt();
(gdb) 

刪除舊的日誌文件,這些文件自最後一個檢查點後已不再需要,以防止保存xlog的磁盤撐滿。

(gdb) n
9038        if (PriorRedoPtr != InvalidXLogRecPtr)
(gdb) p PriorRedoPtr
$23 = 5521450856
(gdb) n
9039            UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
(gdb) 
9045        XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
(gdb) 
9046        KeepLogSeg(recptr, &_logSegNo);
(gdb) p RedoRecPtr
$24 = 5521451392
(gdb) p _logSegNo
$25 = 329
(gdb) p wal_segment_size
$26 = 16777216
(gdb) n
9047        _logSegNo--;
(gdb) 
9048        RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
(gdb) 
9054        if (!shutdown)
(gdb) p recptr
$27 = 5521451504
(gdb) 

執行其他相關收尾工作

(gdb) n
9055            PreallocXlogFiles(recptr);
(gdb) 
9064        if (!RecoveryInProgress())
(gdb) 
9065            TruncateSUBTRANS(GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT));
(gdb) 
9068        LogCheckpointEnd(false);
(gdb) 
9070        TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
(gdb) 
9076        LWLockRelease(CheckpointLock);
(gdb) 
9077    }
(gdb) 

完成調用

(gdb) 
CheckpointerMain () at checkpointer.c:488
488                 ckpt_performed = true;
(gdb) 

DONE!

四、參考資料

checkpointer.c

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章