MySQL · 引擎特性 · InnoDB 崩潰恢復過程

一 序

       之前整理了 InnoDB redo logundo log 的相關知識,本文整理 InnoDB 在崩潰恢復時的主要流程。在《MYSQL運維內參》第11章是穿插着講,在redo log跟undo log.總體上還是taobao.mysql 介紹的全面,本文主要以taobao.mysql爲主。

  Crash Recovery流程

    innobase_init                                                  源碼在innobase/handler/ha_innodb.cc
    -->innobase_start_or_create_for_mysql                       源碼在innobase/srv/srv0start.cc

       本文代碼分析基於 MySQL 5.7.18 版本,函數入口爲 innobase_start_or_create_for_mysql,這是一個非常長的函數,本文只涉及和崩潰恢複相關的代碼。這個圖可以對比書上的流程來幫助理解,因爲牽扯功能點多,容易混亂。

二 初始化崩潰恢復

首先初始化崩潰恢復所需要的內存對象:

        recv_sys_create();
        recv_sys_init(buf_pool_get_curr_size());

當InnoDB正常shutdown,在flush redo log 和髒頁後,會做一次完全同步的checkpoint,並將checkpoint的LSN寫到ibdata的第一個page中(fil_write_flushed_lsn)。

在重啓實例時,會打開系統表空間ibdata,並讀取存儲在其中的LSN:

        err = srv_sys_space.open_or_create(
        false, &sum_of_new_sizes, &flushed_lsn);

上述調用將從ibdata中讀取的LSN存儲到變量flushed_lsn中,表示上次shutdown時的checkpoint點,在後面做崩潰恢復時會用到。另外這裏也會將double write buffer內存儲的page載入到內存中(buf_dblwr_init_or_load_pages),如果ibdata的第一個page損壞了,就從dblwr中恢復出來。

三 進入redo崩潰恢復開始邏輯

入口函數:

err = recv_recovery_from_checkpoint_start(flushed_lsn);

傳遞的參數flushed_lsn即爲從ibdata第一個page讀取的LSN,主要包含以下幾步:

Step 1: 爲每個buffer pool instance創建一棵紅黑樹,指向buffer_pool_t::flush_rbt,主要用於加速插入flush list (buf_flush_init_flush_rbt); Step 2: 讀取存儲在第一個redo log文件頭的CHECKPOINT LSN,並根據該LSN定位到redo日誌文件中對應的位置,從該checkpoint點開始掃描。以下代碼爲recv_find_max_checkpoint之後的調用recv_group_scan_log_recs

/* Look for MLOG_CHECKPOINT. */
	recv_group_scan_log_recs(group, &contiguous_lsn, false);
	/* The first scan should not have stored or applied any records. */
	ut_ad(recv_sys->n_addrs == 0);
	ut_ad(!recv_sys->found_corrupt_fs);

	if (recv_sys->found_corrupt_log && !srv_force_recovery) {
		log_mutex_exit();
		return(DB_ERROR);
	}

	if (recv_sys->mlog_checkpoint_lsn == 0) {
		if (!srv_read_only_mode
		    && group->scanned_lsn != checkpoint_lsn) {
			ib::error() << "Ignoring the redo log due to missing"
				" MLOG_CHECKPOINT between the checkpoint "
				<< checkpoint_lsn << " and the end "
				<< group->scanned_lsn << ".";
			if (srv_force_recovery < SRV_FORCE_NO_LOG_REDO) {
				log_mutex_exit();
				return(DB_ERROR);
			}
		}

		group->scanned_lsn = checkpoint_lsn;
		rescan = false;
	} else {
		contiguous_lsn = checkpoint_lsn;
		rescan = recv_group_scan_log_recs(
			group, &contiguous_lsn, false);

		if ((recv_sys->found_corrupt_log && !srv_force_recovery)
		    || recv_sys->found_corrupt_fs) {
			log_mutex_exit();
			return(DB_ERROR);
		}
	}

	/* NOTE: we always do a 'recovery' at startup, but only if
	there is something wrong we will print a message to the
	user about recovery: */

	if (checkpoint_lsn != flush_lsn) {

		if (checkpoint_lsn + SIZE_OF_MLOG_CHECKPOINT < flush_lsn) {
			ib::warn() << " Are you sure you are using the"
				" right ib_logfiles to start up the database?"
				" Log sequence number in the ib_logfiles is "
				<< checkpoint_lsn << ", less than the"
				" log sequence number in the first system"
				" tablespace file header, " << flush_lsn << ".";
		}

		if (!recv_needed_recovery) {

			ib::info() << "The log sequence number " << flush_lsn
				<< " in the system tablespace does not match"
				" the log sequence number " << checkpoint_lsn
				<< " in the ib_logfiles!";

			if (srv_read_only_mode) {
				ib::error() << "Can't initiate database"
					" recovery, running in read-only-mode.";
				log_mutex_exit();
				return(DB_READ_ONLY);
			}

			recv_init_crash_recovery();
		}
	}

	log_sys->lsn = recv_sys->recovered_lsn;

	if (recv_needed_recovery) {
		err = recv_init_crash_recovery_spaces();

		if (err != DB_SUCCESS) {
			log_mutex_exit();
			return(err);
		}

		if (rescan) {
			contiguous_lsn = checkpoint_lsn;
			recv_group_scan_log_recs(group, &contiguous_lsn, true);

			if ((recv_sys->found_corrupt_log
			     && !srv_force_recovery)
			    || recv_sys->found_corrupt_fs) {
				log_mutex_exit();
				return(DB_ERROR);
			}
		}
	} else {
		ut_ad(!rescan || recv_sys->n_addrs == 0);
	}

在這裏會調用三次recv_group_scan_log_recs掃描redo log文件:

1. 第一次的目的是找到MLOG_CHECKPOINT日誌

MLOG_CHECKPOINT 日誌中記錄了CHECKPOINT LSN,當該日誌中記錄的LSN和日誌頭中記錄的CHECKPOINT LSN相同時,表示找到了符合的MLOG_CHECKPOINT LSN,將掃描到的LSN號記錄到 recv_sys->mlog_checkpoint_lsn 中。(在5.6版本里沒有這一次掃描)

MLOG_CHECKPOINT在WL#7142中被引入,其目的是爲了簡化 InnoDB 崩潰恢復的邏輯,根據WL#7142的描述,包含幾點改進:

  1. 避免崩潰恢復時讀取每個ibd的第一個page來確認其space id;
  2. 無需檢查$datadir/*.isl,新的日誌類型記錄了文件全路徑,並消除了isl文件和實際ibd目錄的不一致可能帶來的問題;
  3. 自動忽略那些還沒有導入到InnoDB的ibd文件(例如在執行IMPORT TABLESPACE時crash);
  4. 引入了新的日誌類型MLOG_FILE_DELETE來跟蹤ibd文件的刪除操作。

2. 第二次掃描,再次從checkpoint點開始重複掃描,存儲日誌對象

日誌解析後的對象類型爲recv_t,包含日誌類型、長度、數據、開始和結束LSN。日誌對象的存儲使用hash結構,根據 space id 和 page no 計算hash值,相同頁上的變更作爲鏈表節點鏈在一起,大概結構可以表示爲:

掃描的過程中,會基於MLOG_FILE_NAME 和MLOG_FILE_DELETE 這樣的redo日誌記錄來構建recv_spaces,存儲space id到文件信息的映射(fil_name_parse –> fil_name_process),這些文件可能需要進行崩潰恢復。(實際上第一次掃描時,也會向recv_spaces中插入數據,但只到MLOG_CHECKPOINT日誌記錄爲止)

Tips:在一次checkpoint後第一次修改某個表的數據時,總是先寫一條MLOG_FILE_NAME 日誌記錄;通過該類型的日誌可以跟蹤一次CHECKPOINT後修改過的表空間,避免打開全部表。 在第二次掃描時,總會判斷將要修改的表空間是否在recv_spaces中,如果不存在,則認爲產生列嚴重的錯誤,拒絕啓動(recv_parse_or_apply_log_rec_body

默認情況下,Redo log以一批64KB(RECV_SCAN_SIZE)爲單位讀入到log_sys->buf中,然後調用函數recv_scan_log_recs處理日誌塊。這裏會判斷到日誌塊的有效性:是否是完整寫入的、日誌塊checksum是否正確, 另外也會根據一些標記位來做判斷:

  • 在每次寫入redo log時,總會將寫入的起始block頭的flush bit設置爲true,表示一次寫入的起始位置,因此在重啓掃描日誌時,也會根據flush bit來推進掃描的LSN點;
  • 每次寫redo時,還會在每個block上記錄下一個checkpoint no(每次做checkpoint都會遞增),由於日誌文件是循環使用的,因此需要根據checkpoint no判斷是否讀到了老舊的redo日誌。

對於合法的日誌,會拷貝到緩衝區recv_sys->buf中,調用函數recv_parse_log_recs解析日誌記錄。 這裏會根據不同的日誌類型分別進行處理,並嘗試進行apply,堆棧爲:

recv_parse_log_recs
    --> recv_parse_log_rec
        --> recv_parse_or_apply_log_rec_body

如果想理解InnoDB如何基於不同的日誌類型進行崩潰恢復的,非常有必要細讀函數recv_parse_or_apply_log_rec_body,這裏是redo日誌apply的入口。

例如如果解析到的日誌類型爲MLOG_UNDO_HDR_CREATE,就會從日誌中解析出事務ID,爲其重建undo log頭(trx_undo_parse_page_header);如果是一條插入操作標識(MLOG_REC_INSERT 或者 MLOG_COMP_REC_INSERT),就需要從中解析出索引信息(mlog_parse_index)和記錄信息(page_cur_parse_insert_rec);或者解析一條IN-PLACE UPDATE (MLOG_REC_UPDATE_IN_PLACE)日誌,則調用函數btr_cur_parse_update_in_place。第二次掃描只會應用MLOG_FILE_*類型的日誌,記錄到recv_spaces中,對於其他類型的日誌在解析後存儲到哈希對象裏。

3. 如果第二次掃描hash表空間不足,無法全部存儲到hash表中,則發起第三次掃描,清空hash,重新從checkpoint點開始掃描。

hash對象的空間最大一般爲buffer pool size - 512個page大小。

第三次掃描不會嘗試一起全部存儲到hash裏,而是一旦發現hash不夠了,就立刻apply redo日誌。但是…如果總的日誌需要存儲的hash空間略大於可用的最大空間,那麼一次額外的掃描開銷還是非常明顯的。

簡而言之,第一次掃描找到正確的MLOG_CHECKPOINT位置;第二次掃描解析 redo 日誌並存儲到hash中;如果hash空間不夠用,則再來一輪重新開始,解析一批,應用一批。

看下對應的源碼:

recv_group_scan_log_recs(
	log_group_t*	group,
	lsn_t*		contiguous_lsn,
	bool		last_phase)
{
	DBUG_ENTER("recv_group_scan_log_recs");
	DBUG_ASSERT(!last_phase || recv_sys->mlog_checkpoint_lsn > 0);

	mutex_enter(&recv_sys->mutex);
	recv_sys->len = 0;
	recv_sys->recovered_offset = 0;
	recv_sys->n_addrs = 0;
	recv_sys_empty_hash();
	srv_start_lsn = *contiguous_lsn;
	recv_sys->parse_start_lsn = *contiguous_lsn;
	recv_sys->scanned_lsn = *contiguous_lsn;
	recv_sys->recovered_lsn = *contiguous_lsn;
	recv_sys->scanned_checkpoint_no = 0;
	recv_previous_parsed_rec_type = MLOG_SINGLE_REC_FLAG;
	recv_previous_parsed_rec_offset	= 0;
	recv_previous_parsed_rec_is_multi = 0;
	ut_ad(recv_max_page_lsn == 0);
	ut_ad(last_phase || !recv_writer_thread_active);
	mutex_exit(&recv_sys->mutex);

	lsn_t	checkpoint_lsn	= *contiguous_lsn;
	lsn_t	start_lsn;
	lsn_t	end_lsn;
    // 在此會根據三個不同的階段調用不同的變量
    // 1. 如果還沒有掃描到MLOG_CHECKPOINT,則爲STORE_NO
    // 2. 第二次掃描則爲STORE_YES
    // 3. 第三次掃描則爲STORE_IF_EXISTS
	store_t	store_to_hash	= recv_sys->mlog_checkpoint_lsn == 0
		? STORE_NO : (last_phase ? STORE_IF_EXISTS : STORE_YES);
	ulint	available_mem	= UNIV_PAGE_SIZE
		* (buf_pool_get_n_pages()
		   - (recv_n_pool_free_frames * srv_buf_pool_instances));

	end_lsn = *contiguous_lsn = ut_uint64_align_down(
		*contiguous_lsn, OS_FILE_LOG_BLOCK_SIZE);

	do {
		if (last_phase && store_to_hash == STORE_NO) {
			store_to_hash = STORE_IF_EXISTS;
			/* We must not allow change buffer
			merge here, because it would generate
			redo log records before we have
			finished the redo log scan. */
			recv_apply_hashed_log_recs(FALSE);
		}

		start_lsn = end_lsn; //下一個分片,從上一個分片結束位置開始
		end_lsn += RECV_SCAN_SIZE;//RECV_SCAN_SIZE 64K,innodb與lsn增長同步,表示2M日誌量
         //從磁盤中讀取數據,lsn偏移位置讀取2M
		log_group_read_log_seg(
			log_sys->buf, group, start_lsn, end_lsn);
        // 從緩存中讀取日誌,並解析,當hash表滿時則直接執行
        //recv_scan_log_recs 會檢查日誌已經分析完畢,redo就算基本完成
	} while (!recv_scan_log_recs(
			 available_mem, &store_to_hash, log_sys->buf,
			 RECV_SCAN_SIZE,
			 checkpoint_lsn,
			 start_lsn, contiguous_lsn, &group->scanned_lsn));

	if (recv_sys->found_corrupt_log || recv_sys->found_corrupt_fs) {
		DBUG_RETURN(false);
	}

	DBUG_PRINT("ib_log", ("%s " LSN_PF
			      " completed for log group " ULINTPF,
			      last_phase ? "rescan" : "scan",
			      group->scanned_lsn, group->id));

	DBUG_RETURN(store_to_hash == STORE_NO);
}

上面的代碼可以看出,recv_group_scan_log_recs主要是從lsn位置開始,日誌分片處理,每個分片2M大小。INNODB如何做分片的,有函數recv_scan_log_recs實現。

recv_scan_log_recs(
/*===============*/
	ulint		available_memory,/*!< in: we let the hash table of recs
					to grow to this size, at the maximum */
	store_t*	store_to_hash,	/*!< in,out: whether the records should be
					stored to the hash table; this is reset
					if just debug checking is needed, or
					when the available_memory runs out */
	const byte*	buf,		/*!< in: buffer containing a log
					segment or garbage */
	ulint		len,		/*!< in: buffer length */
	lsn_t		checkpoint_lsn,	/*!< in: latest checkpoint LSN */
	lsn_t		start_lsn,	/*!< in: buffer start lsn */
	lsn_t*		contiguous_lsn,	/*!< in/out: it is known that all log
					groups contain contiguous log data up
					to this lsn */
	lsn_t*		group_scanned_lsn)/*!< out: scanning succeeded up to
					this lsn */
{
	const byte*	log_block	= buf;
	ulint		no;
	lsn_t		scanned_lsn	= start_lsn;
	bool		finished	= false;
	ulint		data_len;
	bool		more_data	= false;
	bool		apply		= recv_sys->mlog_checkpoint_lsn != 0;
	ulint		recv_parsing_buf_size = RECV_PARSING_BUF_SIZE;

	ut_ad(start_lsn % OS_FILE_LOG_BLOCK_SIZE == 0);
	ut_ad(len % OS_FILE_LOG_BLOCK_SIZE == 0);
	ut_ad(len >= OS_FILE_LOG_BLOCK_SIZE);

	do {
		ut_ad(!finished);
		no = log_block_get_hdr_no(log_block);
		ulint expected_no = log_block_convert_lsn_to_no(scanned_lsn);
		if (no != expected_no) {
			/* Garbage or an incompletely written log block.
			We will not report any error, because this can
			happen when InnoDB was killed while it was
			writing redo log. We simply treat this as an
			abrupt end of the redo log. */
			finished = true;
			break;
		}

		if (!log_block_checksum_is_ok(log_block)) {
			ib::error() << "Log block " << no <<
				" at lsn " << scanned_lsn << " has valid"
				" header, but checksum field contains "
				<< log_block_get_checksum(log_block)
				<< ", should be "
				<< log_block_calc_checksum(log_block);
			/* Garbage or an incompletely written log block.
			This could be the result of killing the server
			while it was writing this log block. We treat
			this as an abrupt end of the redo log. */
			finished = true;
			break;
		}

		if (log_block_get_flush_bit(log_block)) {
			/* This block was a start of a log flush operation:
			we know that the previous flush operation must have
			been completed for all log groups before this block
			can have been flushed to any of the groups. Therefore,
			we know that log data is contiguous up to scanned_lsn
			in all non-corrupt log groups. */

			if (scanned_lsn > *contiguous_lsn) {
				*contiguous_lsn = scanned_lsn;
			}
		}
         //讀取當前塊中存儲的數據量,一個塊大小默認是log_block 512字節
		data_len = log_block_get_data_len(log_block);
        //是否結束 
		if (scanned_lsn + data_len > recv_sys->scanned_lsn
		    && log_block_get_checkpoint_no(log_block)
		    < recv_sys->scanned_checkpoint_no
		    && (recv_sys->scanned_checkpoint_no
			- log_block_get_checkpoint_no(log_block)
			> 0x80000000UL)) {

			/* Garbage from a log buffer flush which was made
			before the most recent database recovery */
			finished = true;
			break;
		}
         
		if (!recv_sys->parse_start_lsn
		    && (log_block_get_first_rec_group(log_block) > 0)) {

			/* We found a point from which to start the parsing
			of log records */

			recv_sys->parse_start_lsn = scanned_lsn
				+ log_block_get_first_rec_group(log_block);
			recv_sys->scanned_lsn = recv_sys->parse_start_lsn;
			recv_sys->recovered_lsn = recv_sys->parse_start_lsn;
		}

		scanned_lsn += data_len;
        //當前塊有數據,處理當前塊
		if (scanned_lsn > recv_sys->scanned_lsn) {

			/* We have found more entries. If this scan is
			of startup type, we must initiate crash recovery
			environment before parsing these log records. */

#ifndef UNIV_HOTBACKUP
			if (!recv_needed_recovery) {

				if (!srv_read_only_mode) {
					ib::info() << "Log scan progressed"
						" past the checkpoint lsn "
						<< recv_sys->scanned_lsn;

					recv_init_crash_recovery();
				} else {

					ib::warn() << "Recovery skipped,"
						" --innodb-read-only set!";

					return(true);
				}
			}
#endif /* !UNIV_HOTBACKUP */

			/* We were able to find more log data: add it to the
			parsing buffer if parse_start_lsn is already
			non-zero */

			DBUG_EXECUTE_IF(
				"reduce_recv_parsing_buf",
				recv_parsing_buf_size
					= (70 * 1024);
				);

			if (recv_sys->len + 4 * OS_FILE_LOG_BLOCK_SIZE
			    >= recv_parsing_buf_size) {
				ib::error() << "Log parsing buffer overflow."
					" Recovery may have failed!";

				recv_sys->found_corrupt_log = true;

#ifndef UNIV_HOTBACKUP
				if (!srv_force_recovery) {
					ib::error()
						<< "Set innodb_force_recovery"
						" to ignore this error.";
					return(true);
				}
#endif /* !UNIV_HOTBACKUP */

			} else if (!recv_sys->found_corrupt_log) {
                //把當前塊真正日誌數據拿出來,放到recv_sys緩存去
				more_data = recv_sys_add_to_parsing_buf(
					log_block, scanned_lsn);
			}
            //更新scanned_lsn,表示已經掃描到這裏
			recv_sys->scanned_lsn = scanned_lsn;
			recv_sys->scanned_checkpoint_no
				= log_block_get_checkpoint_no(log_block);
		}
        //日誌快不足512,表示redo日誌掃描已經結束,已經到了日誌結束的位置
		if (data_len < OS_FILE_LOG_BLOCK_SIZE) {
			/* Log data for this group ends here */
			finished = true;
			break;
		} else {
            //沒結束則掃下一個塊
			log_block += OS_FILE_LOG_BLOCK_SIZE;
		}
	} while (log_block < buf + len);

	*group_scanned_lsn = scanned_lsn;

	if (recv_needed_recovery
	    || (recv_is_from_backup && !recv_is_making_a_backup)) {
		recv_scan_print_counter++;

		if (finished || (recv_scan_print_counter % 80 == 0)) {

			ib::info() << "Doing recovery: scanned up to"
				" log sequence number " << scanned_lsn;
		}
	}
    //上面已經將當前塊或者之前塊日誌放入recv_sys緩存了,下面是對這部分數據進行處理,調用recv_parse_log_recs
	if (more_data && !recv_sys->found_corrupt_log) {
		/* Try to parse more log records */

		if (recv_parse_log_recs(checkpoint_lsn,
					*store_to_hash, apply)) {
			ut_ad(recv_sys->found_corrupt_log
			      || recv_sys->found_corrupt_fs
			      || recv_sys->mlog_checkpoint_lsn
			      == recv_sys->recovered_lsn);
			return(true);
		}
         //這裏判斷佔用空間與可用空間,是否存到hash_table
		if (*store_to_hash != STORE_NO
		    && mem_heap_get_size(recv_sys->heap) > available_memory) {
			*store_to_hash = STORE_NO;
		}
        //處理掉一部分日誌後,緩衝區一版會有部分剩餘不完整的日誌,這部分日誌需要讀取更多日誌
        //拼接後繼續處理
		if (recv_sys->recovered_offset > recv_parsing_buf_size / 4) {
			/* Move parsing buffer data to the buffer start */

			recv_sys_justify_left_parsing_buf();
		}
	}

	return(finished);
}

上面代碼可知,INNODB管理日誌文件,就是連續的日誌內容以塊爲單位來存儲,加上頭尾連續第存儲。使用的時候,又把日誌以塊爲單位讀進來,去掉頭尾拼接在一起,進一步做分析處理。下面我們就看看recv_parse_log_recs如何處理的

recv_parse_log_recs(
	lsn_t		checkpoint_lsn,
	store_t		store,
	bool		apply)
{
	byte*		ptr;
	byte*		end_ptr;
	bool		single_rec;
	ulint		len;
	lsn_t		new_recovered_lsn;
	lsn_t		old_lsn;
	mlog_id_t	type;
	ulint		space;
	ulint		page_no;
	byte*		body;

	ut_ad(log_mutex_own());
	ut_ad(recv_sys->parse_start_lsn != 0);
loop:
	ptr = recv_sys->buf + recv_sys->recovered_offset;

	end_ptr = recv_sys->buf + recv_sys->len;

	if (ptr == end_ptr) {

		return(false);
	}

	switch (*ptr) {
	case MLOG_CHECKPOINT:
#ifdef UNIV_LOG_LSN_DEBUG
	case MLOG_LSN:
#endif /* UNIV_LOG_LSN_DEBUG */
	case MLOG_DUMMY_RECORD:
		single_rec = true;
		break;
	default:
		single_rec = !!(*ptr & MLOG_SINGLE_REC_FLAG);
	}

	if (single_rec) {
		/* The mtr did not modify multiple pages */

		old_lsn = recv_sys->recovered_lsn;

		/* Try to parse a log record, fetching its type, space id,
		page no, and a pointer to the body of the log record */

		len = recv_parse_log_rec(&type, ptr, end_ptr, &space,
					 &page_no, apply, &body);

		if (len == 0) {
			return(false);
		}

		if (recv_sys->found_corrupt_log) {
			recv_report_corrupt_log(
				ptr, type, space, page_no);
			return(true);
		}

		if (recv_sys->found_corrupt_fs) {
			return(true);
		}

		new_recovered_lsn = recv_calc_lsn_on_data_add(old_lsn, len);

		if (new_recovered_lsn > recv_sys->scanned_lsn) {
			/* The log record filled a log block, and we require
			that also the next log block should have been scanned
			in */

			return(false);
		}

		recv_previous_parsed_rec_type = type;
		recv_previous_parsed_rec_offset = recv_sys->recovered_offset;
		recv_previous_parsed_rec_is_multi = 0;

		recv_sys->recovered_offset += len;
		recv_sys->recovered_lsn = new_recovered_lsn;

		switch (type) {
			lsn_t	lsn;
		case MLOG_DUMMY_RECORD:
			/* Do nothing */
			break;
		case MLOG_CHECKPOINT:
#if SIZE_OF_MLOG_CHECKPOINT != 1 + 8
# error SIZE_OF_MLOG_CHECKPOINT != 1 + 8
#endif
			lsn = mach_read_from_8(ptr + 1);

			DBUG_PRINT("ib_log",
				   ("MLOG_CHECKPOINT(" LSN_PF ") %s at "
				    LSN_PF,
				    lsn,
				    lsn != checkpoint_lsn ? "ignored"
				    : recv_sys->mlog_checkpoint_lsn
				    ? "reread" : "read",
				    recv_sys->recovered_lsn));

			if (lsn == checkpoint_lsn) {
				if (recv_sys->mlog_checkpoint_lsn) {
					/* At recv_reset_logs() we may
					write a duplicate MLOG_CHECKPOINT
					for the same checkpoint LSN. Thus
					recv_sys->mlog_checkpoint_lsn
					can differ from the current LSN. */
					ut_ad(recv_sys->mlog_checkpoint_lsn
					      <= recv_sys->recovered_lsn);
					break;
				}
				recv_sys->mlog_checkpoint_lsn
					= recv_sys->recovered_lsn;
#ifndef UNIV_HOTBACKUP
				return(true);
#endif /* !UNIV_HOTBACKUP */
			}
			break;
		case MLOG_FILE_NAME:
		case MLOG_FILE_DELETE:
		case MLOG_FILE_CREATE2:
		case MLOG_FILE_RENAME2:
		case MLOG_TRUNCATE:
			/* These were already handled by
			recv_parse_log_rec() and
			recv_parse_or_apply_log_rec_body(). */
			break;
#ifdef UNIV_LOG_LSN_DEBUG
		case MLOG_LSN:
			/* Do not add these records to the hash table.
			The page number and space id fields are misused
			for something else. */
			break;
#endif /* UNIV_LOG_LSN_DEBUG */
		default:
			switch (store) {
			case STORE_NO:
				break;
			case STORE_IF_EXISTS:
				if (fil_space_get_flags(space)
				    == ULINT_UNDEFINED) {
					break;
				}
				/* fall through */
			case STORE_YES:
				recv_add_to_hash_table(
					type, space, page_no, body,
					ptr + len, old_lsn,
					recv_sys->recovered_lsn);
			}
			/* fall through */
		case MLOG_INDEX_LOAD:
			DBUG_PRINT("ib_log",
				("scan " LSN_PF ": log rec %s"
				" len " ULINTPF
				" page " ULINTPF ":" ULINTPF,
				old_lsn, get_mlog_string(type),
				len, space, page_no));
		}
	} else {
		/* Check that all the records associated with the single mtr
		are included within the buffer */

		ulint	total_len	= 0;
		ulint	n_recs		= 0;
		bool	only_mlog_file	= true;
		ulint	mlog_rec_len	= 0;

		for (;;) {
			len = recv_parse_log_rec(
				&type, ptr, end_ptr, &space, &page_no,
				false, &body);

			if (len == 0) {
				return(false);
			}

			if (recv_sys->found_corrupt_log
			    || type == MLOG_CHECKPOINT
			    || (*ptr & MLOG_SINGLE_REC_FLAG)) {
				recv_sys->found_corrupt_log = true;
				recv_report_corrupt_log(
					ptr, type, space, page_no);
				return(true);
			}

			if (recv_sys->found_corrupt_fs) {
				return(true);
			}

			recv_previous_parsed_rec_type = type;
			recv_previous_parsed_rec_offset
				= recv_sys->recovered_offset + total_len;
			recv_previous_parsed_rec_is_multi = 1;

			/* MLOG_FILE_NAME redo log records doesn't make changes
			to persistent data. If only MLOG_FILE_NAME redo
			log record exists then reset the parsing buffer pointer
			by changing recovered_lsn and recovered_offset. */
			if (type != MLOG_FILE_NAME && only_mlog_file == true) {
				only_mlog_file = false;
			}

			if (only_mlog_file) {
				new_recovered_lsn = recv_calc_lsn_on_data_add(
					recv_sys->recovered_lsn, len);
				mlog_rec_len += len;
				recv_sys->recovered_offset += len;
				recv_sys->recovered_lsn = new_recovered_lsn;
			}

			total_len += len;
			n_recs++;

			ptr += len;

			if (type == MLOG_MULTI_REC_END) {
				DBUG_PRINT("ib_log",
					   ("scan " LSN_PF
					    ": multi-log end"
					    " total_len " ULINTPF
					    " n=" ULINTPF,
					    recv_sys->recovered_lsn,
					    total_len, n_recs));
				total_len -= mlog_rec_len;
				break;
			}

			DBUG_PRINT("ib_log",
				   ("scan " LSN_PF ": multi-log rec %s"
				    " len " ULINTPF
				    " page " ULINTPF ":" ULINTPF,
				    recv_sys->recovered_lsn,
				    get_mlog_string(type), len, space, page_no));
		}

		new_recovered_lsn = recv_calc_lsn_on_data_add(
			recv_sys->recovered_lsn, total_len);

		if (new_recovered_lsn > recv_sys->scanned_lsn) {
			/* The log record filled a log block, and we require
			that also the next log block should have been scanned
			in */

			return(false);
		}

		/* Add all the records to the hash table */

		ptr = recv_sys->buf + recv_sys->recovered_offset;

		for (;;) {
			old_lsn = recv_sys->recovered_lsn;
			/* This will apply MLOG_FILE_ records. We
			had to skip them in the first scan, because we
			did not know if the mini-transaction was
			completely recovered (until MLOG_MULTI_REC_END). */
			len = recv_parse_log_rec(
				&type, ptr, end_ptr, &space, &page_no,
				apply, &body);

			if (recv_sys->found_corrupt_log
			    && !recv_report_corrupt_log(
				    ptr, type, space, page_no)) {
				return(true);
			}

			if (recv_sys->found_corrupt_fs) {
				return(true);
			}

			ut_a(len != 0);
			ut_a(!(*ptr & MLOG_SINGLE_REC_FLAG));

			recv_sys->recovered_offset += len;
			recv_sys->recovered_lsn
				= recv_calc_lsn_on_data_add(old_lsn, len);

			switch (type) {
			case MLOG_MULTI_REC_END:
				/* Found the end mark for the records */
				goto loop;
#ifdef UNIV_LOG_LSN_DEBUG
			case MLOG_LSN:
				/* Do not add these records to the hash table.
				The page number and space id fields are misused
				for something else. */
				break;
#endif /* UNIV_LOG_LSN_DEBUG */
			case MLOG_FILE_NAME:
			case MLOG_FILE_DELETE:
			case MLOG_FILE_CREATE2:
			case MLOG_FILE_RENAME2:
			case MLOG_INDEX_LOAD:
			case MLOG_TRUNCATE:
				/* These were already handled by
				recv_parse_log_rec() and
				recv_parse_or_apply_log_rec_body(). */
				break;
			default:
				switch (store) {
				case STORE_NO:
					break;
				case STORE_IF_EXISTS:
					if (fil_space_get_flags(space)
					    == ULINT_UNDEFINED) {
						break;
					}
					/* fall through */
				case STORE_YES:
					recv_add_to_hash_table(
						type, space, page_no,
						body, ptr + len,
						old_lsn,
						new_recovered_lsn);
				}
			}

			ptr += len;
		}
	}

	goto loop;
}

這裏主要邏輯是區分了single_rec(特殊情況,mtr只有一條日誌,如初始化頁面或者創建頁面等情況),常見的是一個MTR裏面有多條記錄,需要一條條去分析處理。主要是把分析出日誌記錄recv_parse_log_rec(&type, ptr, end_ptr, &space, &page_no,    false, &body);類型,表空間id,頁面號,日誌內容。把日誌記錄加入到hash表中,調用recv_add_to_hash_table。

recv_add_to_hash_table(
/*===================*/
	mlog_id_t	type,		/*!< in: log record type */
	ulint		space,		/*!< in: space id */
	ulint		page_no,	/*!< in: page number */
	byte*		body,		/*!< in: log record body */
	byte*		rec_end,	/*!< in: log record end */
	lsn_t		start_lsn,	/*!< in: start lsn of the mtr */
	lsn_t		end_lsn)	/*!< in: end lsn of the mtr */
{
	recv_t*		recv;
	ulint		len;
	recv_data_t*	recv_data;
	recv_data_t**	prev_field;
	recv_addr_t*	recv_addr;

	ut_ad(type != MLOG_FILE_DELETE);
	ut_ad(type != MLOG_FILE_CREATE2);
	ut_ad(type != MLOG_FILE_RENAME2);
	ut_ad(type != MLOG_FILE_NAME);
	ut_ad(type != MLOG_DUMMY_RECORD);
	ut_ad(type != MLOG_CHECKPOINT);
	ut_ad(type != MLOG_INDEX_LOAD);
	ut_ad(type != MLOG_TRUNCATE);

	len = rec_end - body;

	recv = static_cast<recv_t*>(
		mem_heap_alloc(recv_sys->heap, sizeof(recv_t)));

	recv->type = type;
	recv->len = rec_end - body;
	recv->start_lsn = start_lsn;
	recv->end_lsn = end_lsn;

	recv_addr = recv_get_fil_addr_struct(space, page_no);

	if (recv_addr == NULL) {
		recv_addr = static_cast<recv_addr_t*>(
			mem_heap_alloc(recv_sys->heap, sizeof(recv_addr_t)));

		recv_addr->space = space;
		recv_addr->page_no = page_no;
		recv_addr->state = RECV_NOT_PROCESSED;

		UT_LIST_INIT(recv_addr->rec_list, &recv_t::rec_list);

		HASH_INSERT(recv_addr_t, addr_hash, recv_sys->addr_hash,
			    recv_fold(space, page_no), recv_addr);
		recv_sys->n_addrs++;
#if 0
		fprintf(stderr, "Inserting log rec for space %lu, page %lu\n",
			space, page_no);
#endif
	}

	UT_LIST_ADD_LAST(recv_addr->rec_list, recv);

	prev_field = &(recv->data);

	/* Store the log record body in chunks of less than UNIV_PAGE_SIZE:
	recv_sys->heap grows into the buffer pool, and bigger chunks could not
	be allocated */

	while (rec_end > body) {

		len = rec_end - body;

		if (len > RECV_DATA_BLOCK_SIZE) {
			len = RECV_DATA_BLOCK_SIZE;
		}

		recv_data = static_cast<recv_data_t*>(
			mem_heap_alloc(recv_sys->heap,
				       sizeof(recv_data_t) + len));

		*prev_field = recv_data;

		memcpy(recv_data + 1, body, len);

		prev_field = &(recv_data->next);

		body += len;
	}

	*prev_field = NULL;
}

從上面代碼可知,每一條記錄都有一個類型爲recv_t結構來存儲,成員結構及賦值可以從代碼看到。其中很重要的一點就是recv_fold(space, page_no),HASH表的鍵值是space,page_no的組合值。後面的代碼是吧日誌的body寫入結構體的data。就是說INnodb把每一個日誌記錄分開之後,存儲到了以表空間跟頁面號爲鍵值的HASH表中,相同的頁面肯定是存在一起的,並且同一個頁面的日誌是以先後順序掛在在對應的hash節點的,從而保證了redo操作的有序性。

到這裏,函數recv_group_scan_log_recs就結束了,redo日誌已經獲取並解析到hash表,至於apply對應的函數recv_apply_hashed_log_recs我們下面再看。接下來調用函數recv_init_crash_recovery_spaces對涉及的表空間進行初始化處理:

  • 首先會打印兩條我們非常熟悉的日誌信息:

      [Note] InnoDB: Database was not shutdown normally!
      [Note] InnoDB: Starting crash recovery.
    
  • 如果recv_spaces中的表空間未被刪除,且ibd文件存在時,則表明這是個普通的文件操作,將該table space加入到fil_system->named_spaces鏈表上(fil_names_dirty),後續可能會對這些表做redo apply操作;

  • 對於已經被刪除的表空間,我們可以忽略日誌apply,將對應表的space id在recv_sys->addr_hash上的記錄項設置爲RECV_DISCARDED;

  • 調用函數buf_dblwr_process(),該函數會檢查所有記錄在double write buffer中的page,其對應的數據文件頁是否完好,如果損壞了,則直接從dblwr中恢復;

  • 最後創建一個臨時的後臺線程,線程函數爲recv_writer_thread,這個線程和page cleaner線程配合使用,它會去通知page cleaner線程去flush崩潰恢復產生的髒頁,直到recv_sys中存儲的redo記錄都被應用完成並徹底釋放掉(recv_sys->heap == NULL)

好了,至此recv_recovery_from_checkpoint_start函數結束了,再看trx_sys_init_at_db_start。

四 初始化事務子系統(trx_sys_init_at_db_start)


   這裏會涉及到讀入undo相關的系統頁數據,在崩潰恢復狀態下,所有的page都要先進行日誌apply後,才能被調用者使用,

當實例從崩潰中恢復時,需要將活躍的事務從undo中提取出來,對於ACTIVE狀態的事務直接回滾,對於Prepare狀態的事務,如果該事務對應的binlog已經記錄,則提交,否則回滾事務。

實現的流程也比較簡單,首先先做redo (recv_recovery_from_checkpoint_start),undo是受redo 保護的,因此可以從redo中恢復(臨時表undo除外,臨時表undo是不記錄redo的)。

在redo日誌應用完成後,初始化完成數據詞典子系統(dict_boot),隨後開始初始化事務子系統(trx_sys_init_at_db_start),undo 段的初始化即在這一步完成。

在初始化undo段時(trx_sys_init_at_db_start -> trx_rseg_array_init -> ... -> trx_undo_lists_init),會根據每個回滾段page中的slot是否被使用來恢復對應的undo log,讀取其狀態信息和類型等信息,創建內存結構,並存放到每個回滾段的undo list上。

當初始化完成undo內存對象後,就要據此來恢復崩潰前的事務鏈表了(trx_lists_init_at_db_start),根據每個回滾段的insert_undo_list來恢復插入操作的事務(trx_resurrect_insert),根據update_undo_list來恢復更新事務(tex_resurrect_update),如果既存在插入又存在更新,則只恢復一個事務對象。另外除了恢復事務對象外,還要恢復表鎖及讀寫事務鏈表,從而恢復到崩潰之前的事務場景。

當從Undo恢復崩潰前活躍的事務對象後,會去開啓一個後臺線程來做事務回滾和清理操作(recv_recovery_rollback_active -> trx_rollback_or_clean_all_recovered),對於處於ACTIVE狀態的事務直接回滾,對於既不ACTIVE也非PREPARE狀態的事務,直接則認爲其是提交的,直接釋放事務對象。但完成這一步後,理論上事務鏈表上只存在PREPARE狀態的事務。

隨後很快我們進入XA Recover階段,MySQL使用內部XA,即通過Binlog和InnoDB做XA恢復。在初始化完成引擎後,Server層會開始掃描最後一個Binlog文件,蒐集其中記錄的XID(MYSQL_BIN_LOG::recover),然後和InnoDB層的事務XID做對比。如果XID已經存在於binlog中了,對應的事務需要提交;否則需要回滾事務。

這裏源碼待補充。

五 應用redo日誌(recv_apply_hashed_log_recs)

根據之前蒐集到recv_sys->addr_hash中的日誌記錄,依次將page讀入內存,並對每個page進行崩潰恢復操作(recv_recover_page_func):

  • 已經被刪除的表空間,直接跳過其對應的日誌記錄;

  • 在讀入需要恢復的文件頁時,會主動嘗試採用預讀的方式多讀點page (recv_read_in_area),蒐集最多連續32個(RECV_READ_AHEAD_AREA)需要做恢復的page no,然後發送異步讀請求。 page 讀入buffer pool時,會主動做崩潰恢復邏輯;

  • 只有LSN大於數據頁上LSN的日誌纔會被apply; 忽略被truncate的表的redo日誌;

  • 在恢復數據頁的過程中不產生新的redo 日誌;

  • 在完成修復page後,需要將髒頁加入到buffer pool的flush list上;由於innodb需要保證flush list的有序性,而崩潰恢復過程中修改page的LSN是基於redo 的LSN而不是全局的LSN,無法保證有序性;InnoDB另外維護了一顆紅黑樹來維持有序性,每次插入到flush list前,查找紅黑樹找到合適的插入位置,然後加入到flush list上。(buf_flush_recv_note_modification

六 完成崩潰恢復(recv_recovery_from_checkpoint_finish)

在完成所有redo日誌apply後,基本的崩潰恢復也完成了,此時可以釋放資源,等待recv writer線程退出 (崩潰恢復產生的髒頁已經被清理掉),釋放紅黑樹,回滾所有數據詞典操作產生的非prepare狀態的事務 (trx_rollback_or_clean_recovered)

七 無效數據清理及事務回滾:

調用函數recv_recovery_rollback_active完成下述工作:

  • 刪除臨時創建的索引,例如在DDL創建索引時crash時的殘留臨時索引(row_merge_drop_temp_indexes());
  • 清理InnoDB臨時表 (row_mysql_drop_temp_tables);
  • 清理全文索引的無效的輔助表(fts_drop_orphaned_tables());
  • 創建後臺線程,線程函數爲trx_rollback_or_clean_all_recovered,和在recv_recovery_from_checkpoint_finish中的調用不同,該後臺線程會回滾所有不處於prepare狀態的事務。

至此InnoDB層的崩潰恢復算是告一段落,只剩下處於prepare狀態的事務還有待處理,而這一部分需要和Server層的binlog聯合來進行崩潰恢復。

************************************************************************

八 Binlog/InnoDB XA Recover

回到Server層,在初始化完了各個存儲引擎後,如果binlog打開了,我們就可以通過binlog來進行XA恢復:

  • 首先掃描最後一個binlog文件,找到其中所有的XID事件,並將其中的XID記錄到一個hash結構中(MYSQL_BIN_LOG::recover);
  • 然後對每個引擎調用接口函數xarecover_handlerton, 拿到每個事務引擎中處於prepare狀態的事務xid,如果這個xid存在於binlog中,則提交;否則回滾事務。

很顯然,如果我們弱化配置的持久性(innodb_flush_log_at_trx_commit != 1 或者 sync_binlog != 1), 宕機可能導致兩種丟數據的場景:

  1. 引擎層提交了,但binlog沒寫入,備庫丟事務;
  2. 引擎層沒有prepare,但binlog寫入了,主庫丟事務。

即使我們將參數設置成innodb_flush_log_at_trx_commit =1 和 sync_binlog = 1,也還會面臨這樣一種情況:主庫crash時還有binlog沒傳遞到備庫,如果我們直接提升備庫爲主庫,同樣會導致主備不一致,老主庫必須根據新主庫重做,才能恢復到一致的狀態。針對這種場景,我們可以通過開啓semisync的方式來解決,一種可行的方案描述如下:

  1. 設置雙1強持久化配置;
  2. 我們將semisync的超時時間設到極大值,同時使用semisync AFTER_SYNC模式,即用戶線程在寫入binlog後,引擎層提交前等待備庫ACK;
  3. 基於步驟1的配置,我們可以保證在主庫crash時,所有老主庫比備庫多出來的事務都處於prepare狀態;
  4. 備庫完全apply日誌後,記下其執行到的relay log對應的位點,然後將備庫提升爲新主庫;
  5. 將老主庫的最後一個binlog進行截斷,截斷的位點即爲步驟3記錄的位點;
  6. 啓動老主庫,那些已經傳遞到備庫的事務都會提交掉,未傳遞到備庫的binlog都會回滾掉。

知識點很多,需要再學習

參考:

http://mysql.taobao.org/monthly/2015/06/01/

http://mysql.taobao.org/monthly/2015/04/01/

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章