flashcache中應用device mapper機制

Device Mapper(DM)是Linux 2.6全面引入的塊設備新構架,通過DM可以靈活地管理系統中所有的真實或虛擬的塊設備。
DM以塊設備的形式註冊到Linux內核中,凡是掛載(或者說“映射”)於DM結構下的塊設備,不管他們是如何組織,如何通訊,在Linux看來都是一個完整的DM塊設備。因此DM讓不同組織形式的塊設備或者塊設備集羣在Linux內核面前有一個完整統一的DM表示。

一、辨析兩個名詞(DM與MD)
在Linux內核代碼中(本文以2.6.32內核代碼爲參照),DM指的是整個Device Mapper的設計框架。MD(Mapped Device)是框架所虛擬出來的各種設備。簡而言之DM就是不同種類的MD經過特定的關係連接到塊設備管理器上的大構架。

相關代碼在內核源碼的 driver/md/ 目錄中,其代碼文件可以劃分爲實現 device mapper 內核中基本架構的文件和實現具體映射工作的 target driver 插件文件兩部分


二、幾個重要概念
Device mapper 在內核中作爲一個塊設備驅動被註冊的,它包含三個重要的對象概念,mapped device、映射表、target device。
Mapped device 是一個邏輯抽象,可以理解成爲內核向外提供的邏輯設備,它通過映射表描述的映射關係和 target device 建立映射。
從 Mapped device 到一個 target device 的映射表由一個多元組表示,該多元組由表示 mapped device 邏輯的起始地址、範圍、和表示在 target device 所在物理設備的地址偏移量以及target 類型等變量組成(這些地址和偏移量都是以磁盤的扇區爲單位的,即 512 個字節大小)。
Target device 表示的是 mapped device 所映射的物理空間段,對 mapped device 所表示的邏輯設備來說,就是該邏輯設備映射到的一個物理設備。

Device mapper 中這三個對象和 target driver 插件一起構成了一個可迭代的設備樹。在該樹型結構中的頂層根節點是最終作爲邏輯設備向外提供的 mapped device,葉子節點是 target device 所表示的底層物理設備。最小的設備樹由單個 mapped device 和 target device 組成。每個 target device 都是被mapped device 獨佔的,只能被一個 mapped device 使用。一個 mapped device 可以映射到一個或者多個 target device 上,而一個 mapped device 又可以作爲它上層 mapped device的 target device 被使用,該層次在理論上可以在 device mapper 架構下無限迭代下去。如下圖所示:


在上圖中我們可以看到 mapped device 1 通過映射表和 a、b、c 三個 target device 建立了映射關係,而 target device a 又是通過 mapped device 2 演化過來,mapped device 2 通過映射表和 target device d 建立映射關係,target device d 又可以通過其他的映射關係演化過來。

我們進一步看一下上述三個對象在代碼中的具體實現,dm.c 文件定義的 mapped_device 結構用於表示 mapped device,它主要包括該 mapped device 相關的鎖,註冊的請求隊列和一些內存池以及指向它所對應映射表的指針等域。

struct mapped_device{
	struct rw_semaphore io_lock;
	struct mutex suspend_lock;
	rwlock_t map_lock;
	atomic_t holders;
	atomic_t open_count;

	unsigned long flags;

	struct request_queue *queue;
	struct gendisk *disk;
	char name[16];

	void *interface_ptr;

	/*
	 * A list of ios that arrived while we were suspended.
	 */
	atomic_t pending[2];
	wait_queue_head_t wait;
	struct work_struct work;
	struct bio_list deferred;
	spinlock_t deferred_lock;

	/*
	 * An error from the barrier request currently being processed.
	 */
	int barrier_error;

	/*
	 * Processing queue (flush/barriers)
	 */
	struct workqueue_struct *wq;

	/*
	 * The current mapping.
	 */
	struct dm_table *map;

	/*
	 * io objects are allocated from here.
	 */
	mempool_t *io_pool;
	mempool_t *tio_pool;

	struct bio_set *bs;

	/*
	 * Event handling.
	 */
	atomic_t event_nr;
	wait_queue_head_t eventq;
	atomic_t uevent_seq;
	struct list_head uevent_list;
	spinlock_t uevent_lock; /* Protect access to uevent_list */

	/*
	 * freeze/thaw support require holding onto a super block
	 */
	struct super_block *frozen_sb;
	struct block_device *bdev;

	/* forced geometry settings */
	struct hd_geometry geometry;

	/* marker of flush suspend for request-based dm */
	struct request suspend_rq;

	/* For saving the address of __make_request for request based dm */
	make_request_fn *saved_make_request_fn;

	/* sysfs handle */
	struct kobject kobj;

	/* zero-length barrier that will be cloned and submitted to targets */
	struct bio barrier_bio;
}

Mapped device 對應的映射表是由 dm_table.c 文件中定義的 dm_table 結構表示的,該結構中包含一個 dm_target結構數組,在 dm_table 結構中將這些dm_target 按照 B 樹的方式組織起來方便 IO 請求映射時的查找操作。

struct dm_table {
	struct mapped_device *md;
	atomic_t holders;
	unsigned type;

	/* btree table */
	unsigned int depth;
	unsigned int counts[MAX_DEPTH];	/* in nodes */
	sector_t *index[MAX_DEPTH];

	unsigned int num_targets;
	unsigned int num_allocated;
	sector_t *highs;
	struct dm_target *targets;

	/*
	 * Indicates the rw permissions for the new logical
	 * device.  This should be a combination of FMODE_READ
	 * and FMODE_WRITE.
	 */
	fmode_t mode;

	/* a list of devices used by this table */
	struct list_head devices;

	/* events get handed up using this callback */
	void (*event_fn)(void *);
	void *event_context;

	struct dm_md_mempools *mempools;
}

dm_target 結構描述了 mapped_device 到它某個 target device 的映射關係,具體記錄該結構對應 target device 所映射的 mapped device 邏輯區域的開始地址和範圍,同時還包含指向具體 target device 相關操作的 target_type 結構的指針。

struct dm_target {
	struct dm_table *table;
	struct target_type *type;

	/* target limits */
	sector_t begin;
	sector_t len;

	/* Always a power of 2 */
	sector_t split_io;

	/*
	 * A number of zero-length barrier requests that will be submitted
	 * to the target for the purpose of flushing cache.
	 *
	 * The request number will be placed in union map_info->flush_request.
	 * It is a responsibility of the target driver to remap these requests
	 * to the real underlying devices.
	 */
	unsigned num_flush_requests;

	/* target specific data */
	void *private;

	/* Used to provide an error string from the ctr */
	char *error;
}

target_type 結構主要包含了 target device 對應的 target driver 插件的名字、定義的構建和刪除該類型target device的方法、該類target device對應的IO請求重映射和結束IO的方法等。而表示具體的target device的域是dm_target中的private域,該指針指向mapped device所映射的具體target device對應的結構。
表示target device的具體結構由於不同的target 類型而不同,flashcache中target類型對應target device的結構struct cache_c,如下:

struct cache_c{
	struct dm_target	*tgt;
	/*
	dm_target描述了一個設備,這個塊設備映射爲mapped_device中的某一段
	它是映射設備的基本構成單元
	*/
	struct dm_dev 		*disk_dev;   /* Source device */
	struct dm_dev 		*cache_dev; /* Cache device */

	int 			on_ssd_version;
	
	spinlock_t		cache_spin_lock;//爲臨界資源設置的鎖,保證併發條件下的數據一致性

	struct cacheblock	*cache;	
	/* 
	Hash table for cache blocks 
	cacheblock是在內存中的保存的cache信息,每一個SSD的塊都對應一個cacheblock
	*/
	struct cache_set	*cache_sets;//每一個SSD中的set都對應一個cache_set
	struct cache_md_block_head *md_blocks_buf;//更新SSD上元數據信息時需要用到這個結構

	unsigned int md_block_size;	
	/* 
	Metadata block size in sectors 
	存放元數據信息的塊大小,包含多少個扇區
	*/
	
	sector_t size;			/* Cache size      cache中塊的數量*/
	unsigned int assoc;		/* Cache associativity 每個set默認的block數量爲512*/
	unsigned int block_size;	/* Cache block size     每個塊中包含的扇區個數*/
	unsigned int block_shift;	/* Cache block size in bits */
	unsigned int block_mask;	/* Cache block mask */
	unsigned int assoc_shift;	/* Consecutive blocks size in bits */
	unsigned int num_sets;		/* Number of cache sets */
	
	int	cache_mode;//back、through、around

	wait_queue_head_t destroyq;	/* Wait queue for I/O completion */

	/*
	wait_queue_head_t,讓進程休眠,
	當你在用戶空間需要讀寫一大片數據的時候,這個就用上了。

	1、定義:wait_queue_head_t my_queue;
	2、初始化 init_waitqueue_head(&my_queue);
	3、在一個函數裏面等待:wait_event(queue, condition) ;(別在中斷裏面搞)
	4、在另一個函數裏面喚醒:wake_up(wait_queue_head_t *queue); 
		(這個可以在中斷調用,去喚醒別的進程,特別是dma操作類的)
	*/
	
	/* 
	XXX - Updates of nr_jobs should happen inside the lock. But doing it outside
	   is OK since the filesystem is unmounted at this point 
	*/
	atomic_t nr_jobs;		/* Number of I/O jobs */

#define SLOW_REMOVE    1                                                                                    
#define FAST_REMOVE    2
	atomic_t remove_in_prog;/*該邏輯設備是否正處於被刪除狀態,以及以一種什麼方式刪除*/

	int	dirty_thresh_set;	/* Per set dirty threshold to start cleaning 此處是髒塊數*/
	int	max_clean_ios_set;	/* Max cleaning IOs per set 塊數*/
	int	max_clean_ios_total;	/* Total max cleaning IOs 塊數*/
	int	clean_inprog;//程序中需要回寫的塊數
	int	sync_index;//同步操作時,搜索的第一個塊號
	int	nr_dirty;//處於DIRTY狀態的塊的數量
	unsigned long cached_blocks;	/* Number of cached blocks */
	unsigned long pending_jobs_count;//整個邏輯設備中所有塊對應的等待job的總和數量
	int	md_blocks;		/* Numbers of metadata blocks, including header */

	/* Stats */
	struct flashcache_stats flashcache_stats;//記錄虛擬出的flashcache設備的狀態

	/* Errors */
	struct flashcache_errors flashcache_errors;//記錄虛擬出的flashcache設備的一些錯誤狀態

#define IO_LATENCY_GRAN_USECS	250//描述IO延遲的直方圖的粒度爲250us
#define IO_LATENCY_MAX_US_TRACK	10000	/* 10 ms *///跟蹤的最大的IO延遲爲10ms
#define IO_LATENCY_BUCKETS	(IO_LATENCY_MAX_US_TRACK / IO_LATENCY_GRAN_USECS)
	unsigned long	latency_hist[IO_LATENCY_BUCKETS];//不超過10ms的請求分別統計
	unsigned long	latency_hist_10ms;//超過10ms的請求一起統計
	

#if LINUX_VERSION_CODE < KERNEL_VERSION(2,6,20)
	struct work_struct delayed_clean;//Every pending function is represented by a work_struct
#else
	struct delayed_work delayed_clean;
/*
	To ensure that work queued will be executed after a specified time interval has passed 
	since submission,the work_struct needs to be extended with a timer. 
	The solution is as obvious as can be:
	struct delayed_work 
	{
		struct work_struct work;
		struct timer_list timer;
	};
*/
#endif

	unsigned long pid_expire_check;

/*
In "cache everything" mode:

1.If the pid of the process issuing the IO is in the blacklist, do
	not cache the IO. ELSE,
2.If the tgid is in the blacklist, don't cache this IO. UNLESS
	The particular pid is marked as an exception (and entered in the
	whitelist, which makes the IO cacheable).
3.Finally, even if IO is cacheable up to this point, skip sequential IO 
	if configured by the sysctl.


Conversely, in "cache nothing" mode:

1.If the pid of the process issuing the IO is in the whitelist,
cache the IO. ELSE,
2.If the tgid is in the whitelist, cache this IO. UNLESS
  The particular pid is marked as an exception (and entered in the
	blacklist, which makes the IO non-cacheable).
4.Anything whitelisted is cached, regardless of sequential or random IO.
*/
	struct flashcache_cachectl_pid *blacklist_head, *blacklist_tail;
	struct flashcache_cachectl_pid *whitelist_head, *whitelist_tail;
	int num_blacklist_pids, num_whitelist_pids;
	unsigned long blacklist_expire_check, whitelist_expire_check;//以上是與進程的
	             //黑名單列表與白名單列表的有關變量

#define PENDING_JOB_HASH_SIZE		32
	struct pending_job *pending_job_hashbuckets[PENDING_JOB_HASH_SIZE];
	
	struct cache_c	*next_cache;

	void *sysctl_handle;

	// DM virtual device name, stored in superblock and restored on load
	char dm_vdevname[DEV_PATHLEN];
	// real device names are now stored as UUIDs
	char cache_devname[DEV_PATHLEN];
	char disk_devname[DEV_PATHLEN];

	/* 
	 * If the SSD returns errors, in WRITETHRU and WRITEAROUND modes, 
	 * bypass the cache completely. If the SSD dies or is removed, 
	 * we want to continue sending requests to the device.
	 *(這個device應該是指的整個虛擬出來的flashcache設備)
	 */
	int bypass_cache;

	/* Per device sysctls */
	int sysctl_io_latency_hist;//這個變量置爲1,纔會畫IO請求時間的直方圖
	/*
	Compute IO latencies and plot these out on a histogram.
	The scale is 250 usecs. This is disabled by default since 
	internally flashcache uses gettimeofday() to compute latency
	and this can get expensive depending on the clock source used.

	根據時鐘源的不同,使用gettimeofday() 來計算時延可能會產生很大開銷。
	
	Setting this to 1 enables computation of IO latencies.
	The IO latency histogram is appended to 'dmsetup status'.
	*/
	int sysctl_do_sync;
	/*
	it is for write back
	dev.flashcache.<cachedev>.do_sync = 0
	Schedule cleaning of all dirty blocks in the cache. 
	*/
	int sysctl_stop_sync;
	/*
	it is for write back
	dev.flashcache.<cachedev>.stop_sync = 0
	Stop the sync in progress.
	*/
	int sysctl_dirty_thresh;//這個是髒塊比例
	/*
	it is for write back
	dev.flashcache.<cachedev>.dirty_thresh_pct = 20
	Flashcache will attempt to keep the dirty blocks in each set 
	under this %. A lower dirty threshold increases disk writes, 
	and reduces block overwrites, but increases the blocks
	available for read caching.

	(一個更低的髒頁閥值,會增加磁盤的寫操作,
	爲什麼會降低塊的重寫率呢?)
	*/
	int sysctl_pid_do_expiry;//Enable expiry on the list of pids in the white/black lists.
	int sysctl_max_pids;//Maximum number of pids in the white/black lists.
	int sysctl_pid_expiry_secs;//Set the expiry on the pid white/black lists.
	int sysctl_reclaim_policy;
	/*
	Defaults to FIFO. Can be switched at runtime.
	FIFO (0) vs LRU (1) vs LFU(2)
	*/
	int sysctl_zerostats;//Zero stats (once).
	int sysctl_error_inject;
	int sysctl_fast_remove;
	/*
	it is for write back
	Don't sync dirty blocks when removing cache. On a reload
	both DIRTY and CLEAN blocks persist in the cache. This 
	option can be used to do a quick cache remove. 
	CAUTION: The cache still has uncommitted (to disk) dirty
	blocks after a fast_remove.
	*/
	int sysctl_cache_all;
	/*
	Global caching mode to cache everything or cache nothing.
	See section on Caching Controls. Defaults to "cache everything".
	*/
	int sysctl_fallow_clean_speed;
	/*
	默認15分鐘清理一次,也有不理想的地方,加大了回寫的概率,相應的加大了刷盤
	的數量, 增加後備慢速磁盤的負擔。於是它引入另外參數
	fallow_clean_speed 控制每次回刷的強度。
	it is for write back
	The maximum number of "fallow clean" disk writes per set 
	per second. Defaults to 2.
	*/
	int sysctl_fallow_delay;
	/*
	it is for write back
	In seconds. Clean dirty blocks that have been "idle" (not 
	read or written) for fallow_delay seconds. Default is 15
	minutes. 
	Setting this to 0 disables idle cleaning completely.
	*/
	int sysctl_skip_seq_thresh_kb;
	/*
	Skip (don't cache) sequential IO larger than this number (in kb).
	0 (default) means cache all IO, both sequential and random.
	Sequential IO can only be determined 'after the fact', so
	this much of each sequential I/O will be cached before we skip 
	the rest.  Does not affect searching for IO in an existing cache.
	*/
	/* Sequential I/O spotter */
	struct sequential_io	seq_recent_ios[SEQUENTIAL_TRACKER_QUEUE_DEPTH];
	struct sequential_io	*seq_io_head;
	struct sequential_io 	*seq_io_tail;
}



三、單獨講講 target_driver
每一個target device在內核代碼中體現爲對應的驅動,這些驅動都必須符合DM構架,受DM的管理。有人可能會疑問,爲什麼DM構架中的驅動都是target驅動,而不是MD的驅動?因爲DM的設計中,MD只是一個對外的統一接口,不同target driver的對外接口都是一樣的,因此無需爲不同的虛擬方式編寫不同的MD,只用提供不同的target driver即可(PS:也許這裏叫做mapped driver可以避免混淆,因爲MD和target driver(以後簡稱driver )的實例之間是一對一的關係,而target driver同target device(以後簡稱target )之間是一對多的關係。將driver的概念融合進入md就變成md與target之間一對多的二元關係。

在此統一一下術語的簡稱:我們將mapped device簡稱爲md,target device簡稱爲target。之所以這樣簡稱是因爲內核代碼的命名規則也大致如此。另外的target driver簡稱爲driver(源代碼不會出現,因爲DM框架管理的是target,不是driver。);源設備簡稱爲device(源代碼中只有通過名字包含dev的變量來代表這些設備)。

每個driver需要有一個struct target_type結構向DM註冊自己,並且這個結構在所有driver實例間共享,換句話說所有driver實例都可以看作從屬於這種類型,因此這個target_type應該理解爲driver type纔對。flashcache的struct target_type結構如下:

static struct target_type flashcache_target = {
	.name   = "flashcache",
	.version= {1, 0, 4},
	.module = THIS_MODULE,
	.ctr    = flashcache_ctr,//構建target device 的方法
	.dtr    = flashcache_dtr,//刪除target device 的方法
	.map    = flashcache_map,//Target的映射IO請求的方法
	.status = flashcache_status,//獲取當前target device的狀態
	.ioctl 	= flashcache_ioctl,//使用戶能在設備運行時,動態修改flashcache的參數
};

稍微描述一下flashcache_map的實現如下:

int flashcache_map(struct dm_target *ti, struct bio *bio,
	       		union map_info *map_context){
	struct cache_c *dmc = (struct cache_c *) ti->private;
	int sectors = to_sector(bio->bi_size);
	int queued;
	
	if (sectors <= 32)
		size_hist[sectors]++;//bio請求的大小的直方圖分佈,貌似只記錄16KB以內的

	if (bio_barrier(bio))//設置一個點,強制使前面提交的io請求完成之後,才能處理這個io請求。
		/*
		Insert a serialization point in the IO queue, forcing previously
 		submitted IO to be completed before this one is issued.
		*/
		return -EOPNOTSUPP;/* Operation not supported on transport endpoint */

	VERIFY(to_sector(bio->bi_size) <= dmc->block_size);//bi_size是字節,需被轉換爲扇區

	if (bio_data_dir(bio) == READ)
		dmc->flashcache_stats.reads++;//flashcache_stats記錄的是整個邏輯設備的狀態
	else
		dmc->flashcache_stats.writes++;

	spin_lock_irq(&dmc->cache_spin_lock);//關掉本地中斷,並獲得所要保護的自旋鎖
	if (unlikely(dmc->sysctl_pid_do_expiry && //設置了白名單或黑名單列表中的pid允許過期
		     (dmc->whitelist_head || dmc->blacklist_head)))//並且上述列表不爲空
		flashcache_pid_expiry_all_locked(dmc);//就要檢查上述黑白名單列表中的過期pid,並將其刪除
	if (unlikely(dmc->bypass_cache) ||//這就是分別對應的幾種不可緩存的情況
	    (to_sector(bio->bi_size) != dmc->block_size) ||
	    (bio_data_dir(bio) == WRITE && 
	    	/*
	    	之所以要求是write,因爲對於讀請求,即使是不可緩存的,
	    	處理該讀請求的時候,也是根據是否命中緩存來確定是從SSD中還是disk中服務該讀請求,
	    	只是讀完之後的後續操作不一樣:
	    	可緩存的情況下需要將剛剛訪問的塊加入緩存塊列表;
	    	而不可緩存的情況下不需要加入緩存塊列表。

	    	而對於寫請求就不一樣了,根據是否可緩存,剛開始處理寫請求就已經不一樣了:
	    	對於可緩存的情況下直接是SSD服務寫請求;
	    	對於不可緩存的情況下直接是disk服務寫請求。
	    	*/
	     (dmc->cache_mode == FLASHCACHE_WRITE_AROUND || flashcache_uncacheable(dmc, bio)))) {
	     /*
	     幾種不可緩存的情況:
	     1.明確指定了bypass_cache
	     2.bio請求的大小不等於邏輯設備的塊大小(在設計文檔中有提過,但具體爲什麼???)
	     3.bio請求的類型爲write,
	      並且(邏輯設備的緩存模式爲write around或者處理該bio請求的進程爲不可緩存)
	     */
		queued = flashcache_inval_blocks(dmc, bio);
		/*
		即使是出現不可緩存的情況之後,還得滿足其它條件,才能直接進行uncache disk io
	
		flashcache_inval_blocks函數

		返回1的情況:
			檢測到與bio請求overlap的緩存塊,
			並且該緩存塊或者是處於dirty狀態,
			或者是處於忙碌狀態的valid緩存塊,
			或者是有待處理請求的valid緩存塊。
	
		返回0的情況:
		沒有檢測到與bio請求overlap的緩存塊; 或者
		檢測到與bio請求overlap的緩存塊,但是
		該是該緩存塊並非處於dirty狀態、
		也不是處於忙碌狀態的valid緩存塊、
		也不是有待處理請求的valid緩存塊。

		返回-12的情況:
			分配job時,內存不足。

		 we invalidate any overlapping cache blocks (cleaning them first if necessary).
		*/
		spin_unlock_irq(&dmc->cache_spin_lock);
		if (queued) {//有可能檢測到上述那些塊
			if (unlikely(queued < 0))//若是由於可用內存不夠,沒有檢測到上述那些塊
				flashcache_bio_endio(bio, -EIO, dmc, NULL);
				/*
				根據bio請求的服務時間,進行分類統計,
				並畫出直方圖;
				然後通知bio請求的結束,並返回對bio請求的處理結果
				*/
		/*
		找到上述那些塊之後,就不能直接進行不帶緩存的io處理,
		因爲你在disk裏面處理了該bio請求之後,
		就會將緩存塊原來緩存的disk裏面的內容覆蓋掉,
		如果以前的緩存塊處於dirty狀態,說明其內容還沒有寫回disk,不能被覆蓋,
		如果以前的緩存塊處於忙碌狀態或者有待處理請求,說明以前緩存塊裏面的內容還需被訪問,
		也不能被覆蓋
		*/
		} else {
		/*
		若沒有檢測到上述那些塊,
		可以開始不帶緩存的io請求處理
		*/
			/* Start uncached IO */
			flashcache_start_uncached_io(dmc, bio);
		}
	} else {
		/*
		在可緩存的情況下
		根據bio請求的訪問類型,分別進行處理
		*/
		spin_unlock_irq(&dmc->cache_spin_lock);		
		if (bio_data_dir(bio) == READ)
			flashcache_read(dmc, bio);
		else
			flashcache_write(dmc, bio);
	}
	return DM_MAPIO_SUBMITTED;
}


四、內核中建立一個mapped device的過程:
1、根據內核向用戶空間提供的ioctl 接口傳來的參數,用dm-ioctl.c文件中的dev_create函數創建相應的mapped device結構。這個過程很簡單,主要是向內核申請必要的內存資源,包括mapped device和爲進行IO操作預申請的內存池,通過內核提供的blk_queue_make_request函數註冊該mapped device對應的請求隊列dm_request。並將該mapped device作爲磁盤塊設備註冊到內核中。
2、調用dm_hash_insert將創建好的mapped device插入到device mapper中的一個全局hash表中,該表中保存了內核中當前創建的所有mapped device。
3、用戶空間命令通過ioctl調用table_load函數,該函數根據用戶空間傳來的參數構建指定mapped device的映射表和所映射的target device。該函數先構建相應的dm_table、dm_target結構,再調用dm-table.c中的dm_table_add_target函數根據用戶傳入的參數初始化這些結構,並且根據參數所指定的target類型,調用相應的target類型的構建函數ctr在內存中構建target device對應的結構,然後再根據所建立的dm_target結構更新dm_table中維護的B樹。上述過程完畢後,再將建立好的dm_table添加到mapped device的全局hash表對應的hash_cell結構中。
4、最後通過ioctl調用do_resume函數建立mapped device和映射表之間的綁定關係,事實上該過程就是通過dm_swap_table函數將當前dm_table結構指針值賦予mapped_device相應的map域中,然後再修改mapped_device表示當前狀態的域。
通過上述的4個主要步驟,device mapper在內核中就建立一個可以提供給用戶使用的mapped device邏輯塊設備。


五、附兩張圖加深理解
1、device mapper中幾個重要數據結構的關係

2、flashcache中各種設備的層次關係圖


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章