文件系統學習8——文件系統MQ隊列機制詳解

上一篇已經講述了MQ多隊列的機制，利用cpu的多核，配上多隊列機制，併發的處理IO請求，提高效率。
本篇詳細講述下從bio下發到IO調度器中，MQ隊列機制是如何一步步完成的。

1、MQ處理結構流圖

從整個流程圖可以看到，主要是分爲三個部分：初始化硬件設備的target參數、初始化請求隊列request_queue以及bio請求的處理過程。前面兩個過程主要是完成底層存儲設備向文件系統的註冊，同時完成軟硬隊列映射關係等初始化，後一個部分是bio在MQ機制最後生成對應子請求並掛載在硬件隊列上的過程。

2、scsi設備初始化

對於走scsi協議的底層存儲設備，均完成此初始化過程，內核爲每個scsi設備提供給一個target的設備參數，該參數主要包括：硬件隊列深度，超時時間，硬件中斷函數等。
同時，target的設備參數後序作爲參數傳入requset_queue結構體中，用來初始化硬件隊列相關參數。
具體過程如下：

3、request_queue隊列的初始化

之前已經提到過，linux內核支持單隊列的機制，也支持多隊列的機制，並且在內核中IO調用通過的調用函數都是make_request_fn，那麼內核如何知道選用的是多隊列機制還是單隊列機制呢？？
通過make_request_fn函數的註冊。

初始化步驟如下：

當底層存儲設備是單通道時，此時會向內核make_request_fn註冊實體函數，函數名爲blk_sq_make_request；當設備是多通道時，此時會向內核make_request_fn註冊實體函數，函數名爲blk_mq_make_request。
當註冊完成後，會根據各自隊列的數量，按照內核提供的映射關係，形成cpu號(軟件隊列號)到硬件隊列號的map數組。
按照cpu個數和硬件隊列數分別分配軟硬件隊列環境並進行初始化
將互爲映射的軟硬件隊列環境的參數相互關聯。

具體函數流程圖如下：

關於隊列初始化的源碼閱讀如下：
1.首先在driver\scsi文件夾下的scsi_scan.c文件中，scsi_alloc_sdev函數會判斷當前scsi支持的塊設備是否支持multiqueue：

if (shost_use_blk_mq(shost))    // multiple queue is enabled   gaocm
     sdev->request_queue = scsi_mq_alloc_queue(sdev);    //MQ隊列的定義   
 else
     sdev->request_queue = scsi_alloc_queue(sdev);

2.若上述設備支持mq則進入scsi_mq_alloc_queue函數，該函數中主要進行設備隊列的初始化，blk_mq_init_queue，在blk_mq_init_queue 函數中根據set信息進行與該設備隊列相關的信息參數初始化，過程如下：

  /* mark the queue as mq asap */
   q->mq_ops = set->ops;   //標記爲MQ隊列   

   q->queue_ctx = alloc_percpu(struct blk_mq_ctx); //獲得per_cpu的地址 建立software queue環境 
   if (!q->queue_ctx)
       goto err_exit;

   q->queue_hw_ctx = kzalloc_node(nr_cpu_ids * sizeof(*(q->queue_hw_ctx)),
                       GFP_KERNEL, set->numa_node);    //獲得hardware queue上下文環境 
   if (!q->queue_hw_ctx)
       goto err_percpu;

   q->mq_map = blk_mq_make_queue_map(set);   // 建立software queue和hardware queue之間的映射關係 
   if (!q->mq_map)
       goto err_map;

   blk_mq_realloc_hw_ctxs(set, q);
   if (!q->nr_hw_queues)
       goto err_hctxs;

   INIT_WORK(&q->timeout_work, blk_mq_timeout_work);
   blk_queue_rq_timeout(q, set->timeout ? set->timeout : 30 * HZ); //定義scsi設備隊列超時設定    

   q->nr_queues = nr_cpu_ids;

   q->queue_flags |= QUEUE_FLAG_MQ_DEFAULT;

   if (!(set->flags & BLK_MQ_F_SG_MERGE))
       q->queue_flags |= 1 << QUEUE_FLAG_NO_SG_MERGE;

   q->sg_reserved_size = INT_MAX;

   INIT_WORK(&q->requeue_work, blk_mq_requeue_work);
   INIT_LIST_HEAD(&q->requeue_list);
   spin_lock_init(&q->requeue_lock);

   if (q->nr_hw_queues > 1)
       blk_queue_make_request(q, blk_mq_make_request); //註冊q->make_request_fn函數        
   else
       blk_queue_make_request(q, blk_sq_make_request);

   /*
    * Do this after blk_queue_make_request() overrides it...
    */
   q->nr_requests = set->queue_depth;

   if (set->ops->complete)
       blk_queue_softirq_done(q, set->ops->complete);

   blk_mq_init_cpu_queues(q, set->nr_hw_queues);

   get_online_cpus();
   mutex_lock(&all_q_mutex);

   list_add_tail(&q->all_q_node, &all_q_list);
   blk_mq_add_queue_tag_set(set, q);
   blk_mq_map_swqueue(q, cpu_online_mask);

   mutex_unlock(&all_q_mutex);
   put_online_cpus();

   return q;

err_hctxs:
   kfree(q->mq_map);
err_map:
   kfree(q->queue_hw_ctx);
err_percpu:
   free_percpu(q->queue_ctx);
err_exit:
   q->mq_ops = NULL;
   return ERR_PTR(-ENOMEM);
}

3.完成上述的初始化之後，在blk_queue_make_request中將q->make_request_fn註冊爲blk_mq_make_request。

4、bio請求處理

bio的請求處理是device mapper中提交一個個bio請求到IO schedule中之後完成的。整個處理的流圖如下圖所示：

首先，bio請求被提交(submit_bio)，此時進入generic_make_request，表明其將在塊設備層中被進行相關處理工作。
generic_make_request函數如下：

blk_qc_t generic_make_request(struct bio *bio)
{
    struct bio_list bio_list_on_stack;
    blk_qc_t ret = BLK_QC_T_NONE;

    if (!generic_make_request_checks(bio))  //判斷當前bio是否有效       gaocm
        goto out;
    if (current->bio_list) {
        bio_list_add(current->bio_list, bio);
        goto out;
    }
    //上述過程要求當前的make_request_fn每次只能被觸發一次，因此，通過current->bio_list判斷當前是否有bio在其中，若有則將當前這個加入到尾部等待被處理，若沒有則可直接處理該bio   

    BUG_ON(bio->bi_next);
    bio_list_init(&bio_list_on_stack);      //初始化該雙向鏈表  
    current->bio_list = &bio_list_on_stack; //當前爲NULL   
    do {
        struct request_queue *q = bdev_get_queue(bio->bi_bdev); //獲得bio對應的設備隊列  

        if (likely(blk_queue_enter(q, false) == 0)) {   //判斷當前的設備隊列是否有效能夠響應該請求  
            ret = q->make_request_fn(q, bio);   //將bio進行進一步處理，放入塊設備層的處理隊列中  
            blk_queue_exit(q);

            bio = bio_list_pop(current->bio_list);  
        } else {
            struct bio *bio_next = bio_list_pop(current->bio_list);

            bio_io_error(bio);
            bio = bio_next;
        }
    } while (bio);
    current->bio_list = NULL; /* deactivate */  //clear this bio list and make_request function is avalible again

out:
    return ret;
}

沒錯，從代碼上分析和我們之前講的一樣，bio會通過隊列機制進行一個管理，當處理一個bio時，首先或得當前bio的設備隊列（即軟硬件隊列），然後判斷當前的設備隊列是否有效能夠響應該請求，如果能夠則調用make_request_fn，該函數即是硬件初始化時候註冊的函數。

然後，對於MQ機制，make_request_fn的重載即爲blk_mq_make_request函數，該函數如下：

    const int is_sync = rw_is_sync(bio->bi_rw); //判斷是否爲同步       
    const int is_flush_fua = bio->bi_rw & (REQ_FLUSH | REQ_FUA);//判斷是否爲屏障IO    
    struct blk_map_ctx data;
    struct request *rq;
    unsigned int request_count = 0;
    struct blk_plug *plug;
    struct request *same_queue_rq = NULL;
    blk_qc_t cookie;

    blk_queue_bounce(q, &bio);      //做DMA時的相關地址限制，可能該bio只能訪問低端內存，因此需要將高端內存中的bio數據拷貝到低端內存中  

    if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {    //bio完整性判斷
        bio_io_error(bio);
        return BLK_QC_T_NONE;
    }

    blk_queue_split(q, &bio, q->bio_split); //判斷當前的bio是否超過了預設最大處理大小，若是則進行拆分，拆分後會進行generic_make_request函數調用  

    if (!is_flush_fua && !blk_queue_nomerges(q) &&
        blk_attempt_plug_merge(q, bio, &request_count, &same_queue_rq)) //若非屏障IO並且設備隊列支持合併且plug隊列中可進行合併則進行合併工作  
        return BLK_QC_T_NONE;

    rq = blk_mq_map_request(q, bio, &data); //在mq中註冊一個request       
    if (unlikely(!rq))
        return BLK_QC_T_NONE;

    cookie = blk_tag_to_qc_t(rq->tag, data.hctx->queue_num);

    if (unlikely(is_flush_fua)) {
        blk_mq_bio_to_request(rq, bio); //將bio轉換爲request    
        blk_insert_flush(rq);   //若是屏障IO則將其加入到flush隊列中，該隊列直接發送至driver   
        goto run_queue;
    }

    plug = current->plug;
    /*
     * If the driver supports defer issued based on 'last', then
     * queue it up like normal since we can potentially save some
     * CPU this way.
     */
    if (((plug && !blk_queue_nomerges(q)) || is_sync) &&    //有plug隊列，且設備隊列支持合併或者改請求是同步請求。。
        !(data.hctx->flags & BLK_MQ_F_DEFER_ISSUE)) {   //延遲發送  
        struct request *old_rq = NULL;

        blk_mq_bio_to_request(rq, bio); //轉化爲request    

        /*
         * We do limited pluging. If the bio can be merged, do that.
         * Otherwise the existing request in the plug list will be
         * issued. So the plug list will have one request at most
         */
        if (plug) {
            /*
             * The plug list might get flushed before this. If that
             * happens, same_queue_rq is invalid and plug list is
             * empty
             */
            if (same_queue_rq && !list_empty(&plug->mq_list)) {
                old_rq = same_queue_rq;
                list_del_init(&old_rq->queuelist);  //判斷之前是否有能合併或者一樣的請求，若有則刪除之前的請求  
            }
            list_add_tail(&rq->queuelist, &plug->mq_list);  //將該請求加入到plug隊列中
        } else /* is_sync */
            old_rq = rq;
        blk_mq_put_ctx(data.ctx);   
        if (!old_rq)    //無爲處理請求    
            goto done;
        if (!blk_mq_direct_issue_request(old_rq, &cookie))  //直接加入到底層scsi層隊列中，併發往driver?    
            goto done;
        blk_mq_insert_request(old_rq, false, true, true);   //加入到software queue中    
        goto done;
    }

    if (!blk_mq_merge_queue_io(data.hctx, data.ctx, rq, bio)) { //底層driver支持延遲發送或者爲async請求  
        //能合併則進行合併，否則加入到software queue中 
        /*
         * For a SYNC request, send it to the hardware immediately. For
         * an ASYNC request, just ensure that we run it later on. The
         * latter allows for merging opportunities and more efficient
         * dispatching.
         */
run_queue:
        blk_mq_run_hw_queue(data.hctx, !is_sync || is_flush_fua);   //執行hardware queue  
    }
    blk_mq_put_ctx(data.ctx);
done:
    return cookie;
}

該函數完成bio轉換爲請求的下發，主要分爲以下四種情況：
1、對於flush等屏障IO，不通過軟件隊列，直接下發對應的flush隊列，因爲屏障IO具有非延時的特性，需要寫入到driver中，不能被隊列所阻塞。
2、對於非屏障IO，首先判斷隊列裏是否能合併，如果不能合併，則產生一個request，下發到plug隊列中，plug隊列命中，則直接返回。
3、plug隊列未命中，則將該請求加入到plug隊列中，注意，實際在MQ機制中plug隊列沒有泄洪的作用，實際其隊列深度不超過1.對於這類加入到plug隊列的請求，最終還是下發到軟件隊列中。
4、下發到軟件隊列的請求同樣判斷是否能合併，合併則直接返回，不能合併則受軟件隊列的調度器調度，最後加入到硬件隊列中。

這四種情況分別對應的源碼如下圖所示：

blk_mq_direct_issue_request 將請求直接加入到底層driver中，判斷當前scsi設備能否處理該請求

static int blk_mq_direct_issue_request(struct request *rq, blk_qc_t *cookie)
{
    int ret;
    struct request_queue *q = rq->q;
    struct blk_mq_hw_ctx *hctx = q->mq_ops->map_queue(q,
            rq->mq_ctx->cpu);
    struct blk_mq_queue_data bd = {
        .rq = rq,
        .list = NULL,
        .last = 1
    };
    blk_qc_t new_cookie = blk_tag_to_qc_t(rq->tag, hctx->queue_num);

    /*
     * For OK queue, we are done. For error, kill it. Any other
     * error (busy), just add it to our list as we previously
     * would have done
     */
    ret = q->mq_ops->queue_rq(hctx, &bd);   //直接放入scsi隊列中，返回是否能夠被處理 
    if (ret == BLK_MQ_RQ_QUEUE_OK) {
        *cookie = new_cookie;
        return 0;
    }

    __blk_mq_requeue_request(rq);   //標記該request的nr_phys_segments減1 

    if (ret == BLK_MQ_RQ_QUEUE_ERROR) {
        *cookie = BLK_QC_T_NONE;
        rq->errors = -EIO;
        blk_mq_end_request(rq, rq->errors);
        return 0;
    }

    return -1;
}

blk_mq_insert_request 將請求加入到software queue隊列中：

void blk_mq_insert_request(struct request *rq, bool at_head, bool run_queue,
        bool async)
{
    struct request_queue *q = rq->q;
    struct blk_mq_hw_ctx *hctx;
    struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;

    current_ctx = blk_mq_get_ctx(q);    //獲得software queue環境    
    if (!cpu_online(ctx->cpu))
        rq->mq_ctx = ctx = current_ctx;

    hctx = q->mq_ops->map_queue(q, ctx->cpu);   //找到對應的hardware queue上下文環境  

    spin_lock(&ctx->lock);
    __blk_mq_insert_request(hctx, rq, at_head); //通過rq找到ctx，加入到software queue中  
    spin_unlock(&ctx->lock);

    if (run_queue)
        blk_mq_run_hw_queue(hctx, async);   //運行hardware queue，用異步方式執行  

    blk_mq_put_ctx(current_ctx);
}

blk_mq_merge_queue_io 判斷能否與當前software queue中的請求進行合併

static inline bool blk_mq_merge_queue_io(struct blk_mq_hw_ctx *hctx,
                     struct blk_mq_ctx *ctx,
                     struct request *rq, struct bio *bio)
{
    if (!hctx_allow_merges(hctx) || !bio_mergeable(bio)) {  //不允許merge  
        blk_mq_bio_to_request(rq, bio);
        spin_lock(&ctx->lock);
        __blk_mq_insert_request(hctx, rq, false);   //加入到software queue中    
        spin_unlock(&ctx->lock);
        return false;
    } else {
        struct request_queue *q = hctx->queue;

        spin_lock(&ctx->lock);
        if (!blk_mq_attempt_merge(q, ctx, bio)) {   //進行合併嘗試    
            blk_mq_bio_to_request(rq, bio); //無法合併則轉向加入software queue中  
            goto insert_rq;
        }

        spin_unlock(&ctx->lock);
        __blk_mq_free_request(hctx, ctx, rq);   //將剛剛在software queue和hardware queue中註冊的request去除，因爲請求已經加入到software queue中       
        return true;
    }
}

硬件隊列請求執行blk_mq_run_hw_queue函數，下發請求到scsi驅動層中

void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async)
{
    if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state) ||
        !blk_mq_hw_queue_mapped(hctx)))
        return;

    if (!async) {   // false and run  gaocm
    //若async爲flash則說明該處理是同步的，需要馬上處理，若是異步則將該操作交由kblocked進行處理 
        int cpu = get_cpu();
        if (cpumask_test_cpu(cpu, hctx->cpumask)) {  // cpu is set in cpumask    
            __blk_mq_run_hw_queue(hctx);    //運行hardware queue  
            put_cpu();
            return;
        }

        put_cpu();
    }

    kblockd_schedule_delayed_work_on(blk_mq_hctx_next_cpu(hctx),
            &hctx->run_work, 0);
}

以上就是MQ隊列的執行過程。

詳細的執行流程見博客Linux Block Layer塊設備層基於MultiQueue的部分源碼分析

5、MQ隊列調度器

完成上述的MQ隊列的執行過程，但是沒有涉及到調度器這一塊，經過請教，發現MQ也存在調度器模塊，調度器主要完成同一個task中的調度，也就是一個或者多個軟件隊列上的請求的調度，調度的目的和傳統CFQ類似，即考慮底層磁盤特性，更高效的利用存儲介質的IO讀寫。

注意：軟隊列的數目和cpu核數對應，在應用執行過程中，上層cpu的調度中就爲該應用綁定了對應cpu執行的個數，一個應用可以運行在多個核上，那麼這個應用下發的IO將會執行在多個軟件隊列中，即在多個軟件隊列中完成調度。個人猜測，未經驗證。

6、疑問與解答

在內核支持了MQ機制之後，倘若底層硬件某個通道任務繁忙，此時IO被阻塞的非常嚴重，上層應用如何感知到呢？此時軟硬件隊列的映射能夠改變嗎？
——軟硬件隊列的映射在硬件初始化階段時就已經確定了，而且是不能改的。當某個硬件通道IO密集時，此時會獲取該硬件隊列對應的軟件隊列，從而獲取IO較爲繁忙的CPU，CPU在得知各個核的狀態時，在應用任務下發時，cpu會重新調度，避免將過重的IO綁定到任務繁忙的cpu核上。

參考博客：

https://blog.csdn.net/g382112762/article/details/79606485
https://blog.csdn.net/notbaron/article/details/81147591
https://blog.csdn.net/yedushu/article/details/82050933
https://hyunyoung2.github.io/2016/09/14/Multi_Queue/

文件系統學習8——文件系統MQ隊列機制詳解

1、MQ處理結構流圖

2、scsi設備初始化

3、request_queue隊列的初始化

4、bio請求處理

5、MQ隊列調度器

6、疑問與解答

參考博客：

Zookeeper系列學習

文件系統學習7——文件系統隊列機制概述

Paxos的實際應用

分佈式消息隊列RabbitMQ

分佈式系統領域經典論文翻譯集

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結