Now, I’m interested in multi-block layer and open source.

So, this article is for me to study about multi queue,

This article is refered from one URL, The Multi-Queue Interface and NVM Express ver 1.2 specification

Also, I recommend this site,”Linux Kernel Newbies”. for kernel newvbies.

And If you want to see the linux kernel, Just click this site, Linux Kernel’s Version.

!!! So, If you want know why mutilqueu is good for SSD. Just try to read this

Multi Block layer

This multi queue layer’s summary is from The Multi-Queue Interface Article

Linux kernel git shows you conversion into blk-mq

The blk_mq API Implements a two-levels block layer design which use two separate sets of request queues.

software staging queues, allocated per-CPU

hardware dispatch queues, whose number typically match the number of actual haraware queues supported by th blcok device.

as you can see the above picture.

the number of mapped queues between software staging queues and hardware dispatch queues is different

Now, let’s think of case that two queue put differently.

let’s say, three cases are in here.

software staging queues > hardware dispatch queues

In this case, two or more software staging queues is allocated to one of hardware context. And a dispatch is performed while the hardware context will pull requests in from all the associated software queues.

software staging queue < hardware dispatch queues

In this case, mapping between software staging queues and hardware dispatch queues is sequential.

software staging queue == hardware dispatch queue

In this case, this is the most simple case. i.e a direct 1 : 1 mapping is performed.

Main data structure in mulit-queue block layer.

Basic Archictecture

1. blk_mq_reg (in kernel 4.5)

according to The Multi-Queue Interface Article,

blk_mq_reg structure contains all of important information during the register of a new block device to the block layer.

This data structure includes the pointer to a blk_mq_ops data structure, used to keep track of the specific routines to be used by the multi-queue block layer to interact with the device’s driver.

blk_mq_reg structure also keeps the number of hardware queues to be initialized and so on.

** However, blk_mq_reg disappear.

I think I need to consider blk_mq_ops to understand operation between block layer and block device

So you can ckeck the data structure in kernel 3.15

struct blk_mq_reg {
       struct blk_mq_ops       *ops;
       unsigned int            nr_hw_queues;
       unsigned int            queue_depth;
       unsigned int            reserved_tags;
       unsigned int            cmd_size;       /* per-request extra data */
       int                     numa_node;
       unsigned int            timeout;
       unsigned int            flags;          /* BLK_MQ_F_* */
};

But, In kernel 4.5, You can not check this data structure.

I think this data structure is changed to struct blk_mq_tag_set *set in kernel 4.5

struct blk_mq_tag_set {
       struct blk_mq_ops       *ops;
       unsigned int            nr_hw_queues;
       unsigned int            queue_depth;    /* max hw supported */
       unsigned int            reserved_tags;
       unsigned int            cmd_size;       /* per-request extra data */
       int                     numa_node;
       unsigned int            timeout;
       unsigned int            flags;          /* BLK_MQ_F_* */
       void                    *driver_data;

       struct blk_mq_tags      **tags;

       struct mutex            tag_list_lock;
       struct list_head        tag_list;
};

Because, In kernel 3.15, function(struct request_queue *blk_mq_init_queue(struct blk_mq_reg *reg, void *driver_data)) is changed to function(struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set))

2. blk_mq_ops structure(in kernel 4.5)

as you can read above, this data structure is used to communicate multi-queue block layer with the block device’s layer.

In this data structure, the function performing mapping contexts between the blk_mq_hw_ctx and the blk_mq_ctx is stored in map_queue field.

struct blk_mq_ops {
       /*
        * Queue request
        */
       queue_rq_fn             *queue_rq; // this part

       /*
        * Map to specific hardware queue
        */
       map_queue_fn            *map_queue; // this part

       /*
        * Called on request timeout
        */
       timeout_fn              *timeout;

       /*
        * Called to poll for completion of a specific tag.
        */
       poll_fn                 *poll;

       softirq_done_fn         *complete;

       /*
        * Called when the block layer side of a hardware queue has been
        * set up, allowing the driver to allocate/init matching structures.
        * Ditto for exit/teardown.
        */
       init_hctx_fn            *init_hctx;
       exit_hctx_fn            *exit_hctx;

       /*
        * Called for every command allocated by the block layer to allow
        * the driver to set up driver specific data.
        *
        * Tag greater than or equal to queue_depth is for setting up
        * flush request.
        *
        * Ditto for exit/teardown.
        */
       init_request_fn         *init_request;
       exit_request_fn         *exit_request;
};

3. blk_mq_hw_ctx structure (in kernel 4.5)

blk_mq_hw_ctx structure represents the hardware context which a request_queue is associated to.

And this corresponding structure is the blk_mq_ctx structure.(in kernel 4.5)

struct blk_mq_hw_ctx {
       struct {
               spinlock_t              lock;
               struct list_head        dispatch;
       } ____cacheline_aligned_in_smp;

       unsigned long           state;          /* BLK_MQ_S_* flags */
       struct delayed_work     run_work;
       struct delayed_work     delay_work;
       cpumask_var_t           cpumask;
       int                     next_cpu;
       int                     next_cpu_batch;

       unsigned long           flags;          /* BLK_MQ_F_* flags */

       struct request_queue    *queue;
       struct blk_flush_queue  *fq;

       void                    *driver_data;

       struct blk_mq_ctxmap    ctx_map;

       unsigned int            nr_ctx;
       struct blk_mq_ctx       **ctxs;

       atomic_t                wait_index;

       struct blk_mq_tags      *tags;

       unsigned long           queued;
       unsigned long           run;
#define BLK_MQ_MAX_DISPATCH_ORDER       10
       unsigned long           dispatched[BLK_MQ_MAX_DISPATCH_ORDER];

       unsigned int            numa_node;
       unsigned int            queue_num;

       atomic_t                nr_active;

       struct blk_mq_cpu_notifier      cpu_notifier;
       struct kobject          kobj;

       unsigned long           poll_invoked;
       unsigned long           poll_success;
};

4. blk_mq_ctx structure.(in kernel 4.5)

As you can read above, blk_mq_ctx as software staging queue is allocated per CPU.

struct blk_mq_ctx {
       struct {
               spinlock_t              lock;
               struct list_head        rq_list;
       }  ____cacheline_aligned_in_smp;

       unsigned int            cpu;
       unsigned int            index_hw;

       unsigned int            last_tag ____cacheline_aligned_in_smp;

       /* incremented at dispatch time */
       unsigned long           rq_dispatched[2];
       unsigned long           rq_merged;

       /* incremented at completion time */
       unsigned long           ____cacheline_aligned_in_smp rq_completed[2];

       struct request_queue    *queue;
       struct kobject          kobj;
} ____cacheline_aligned_in_smp;

5. the request_queue structure (in kernel 4.5)

The mapping of contexts between between the blk_mq_hw_ctx and the blk_mq_ctx is built on map_queue field of blk_mq_ops structure. The mapping is kept as themq_map of request_queue data structure related to the block device in kernel 4.5

struct request_queue {
       /*
        * Together with queue_head for cacheline sharing
        */
       struct list_head        queue_head;
       struct request          *last_merge;
       struct elevator_queue   *elevator;
       int                     nr_rqs[2];      /* # allocated [a]sync rqs */
       int                     nr_rqs_elvpriv; /* # allocated rqs w/ elvpriv */

       /*
        * If blkcg is not used, @q->root_rl serves all requests.  If blkcg
        * is used, root blkg allocates from @q->root_rl and all other
        * blkgs from their own blkg->rl.  Which one to use should be
        * determined using bio_request_list().
        */
       struct request_list     root_rl;

       request_fn_proc         *request_fn;
       make_request_fn         *make_request_fn;
       prep_rq_fn              *prep_rq_fn;
       unprep_rq_fn            *unprep_rq_fn;
       softirq_done_fn         *softirq_done_fn;
       rq_timed_out_fn         *rq_timed_out_fn;
       dma_drain_needed_fn     *dma_drain_needed;
       lld_busy_fn             *lld_busy_fn;

       struct blk_mq_ops       *mq_ops;

       unsigned int            *mq_map;

       /* sw queues */
       struct blk_mq_ctx __percpu      *queue_ctx;
       unsigned int            nr_queues;

       /* hw dispatch queues */
       struct blk_mq_hw_ctx    **queue_hw_ctx;
       unsigned int            nr_hw_queues;

       /*
        * Dispatch queue sorting
        */
       sector_t                end_sector;
       struct request          *boundary_rq;

       /*
        * Delayed queue handling
        */
       struct delayed_work     delay_work;

       struct backing_dev_info backing_dev_info;

       /*
        * The queue owner gets to use this for whatever they like.
        * ll_rw_blk doesn't touch it.
        */
       void                    *queuedata;

       /*
        * various queue flags, see QUEUE_* below
        */
       unsigned long           queue_flags;

       /*
        * ida allocated id for this queue.  Used to index queues from
        * ioctx.
        */
       int                     id;

       /*
        * queue needs bounce pages for pages above this limit
        */
       gfp_t                   bounce_gfp;

       /*
        * protects queue structures from reentrancy. ->__queue_lock should
        * _never_ be used directly, it is queue private. always use
        * ->queue_lock.
        */
       spinlock_t              __queue_lock;
       spinlock_t              *queue_lock;

       /*
        * queue kobject
        */
       struct kobject kobj;

       /*
        * mq queue kobject
        */
       struct kobject mq_kobj;

#ifdef  CONFIG_BLK_DEV_INTEGRITY
       struct blk_integrity integrity;
#endif  /* CONFIG_BLK_DEV_INTEGRITY */

#ifdef CONFIG_PM
       struct device           *dev;
       int                     rpm_status;
       unsigned int            nr_pending;
#endif

       /*
        * queue settings
        */
       unsigned long           nr_requests;    /* Max # of requests */
       unsigned int            nr_congestion_on;
       unsigned int            nr_congestion_off;
       unsigned int            nr_batching;

       unsigned int            dma_drain_size;
       void                    *dma_drain_buffer;
       unsigned int            dma_pad_mask;
       unsigned int            dma_alignment;

       struct blk_queue_tag    *queue_tags;
       struct list_head        tag_busy_list;

       unsigned int            nr_sorted;
       unsigned int            in_flight[2];
       /*
        * Number of active block driver functions for which blk_drain_queue()
        * must wait. Must be incremented around functions that unlock the
        * queue_lock internally, e.g. scsi_request_fn().
        */
       unsigned int            request_fn_active;

       unsigned int            rq_timeout;
       struct timer_list       timeout;
       struct work_struct      timeout_work;
       struct list_head        timeout_list;

       struct list_head        icq_list;
#ifdef CONFIG_BLK_CGROUP
       DECLARE_BITMAP          (blkcg_pols, BLKCG_MAX_POLS);
       struct blkcg_gq         *root_blkg;
       struct list_head        blkg_list;
#endif

       struct queue_limits     limits;

       /*
        * sg stuff
        */
       unsigned int            sg_timeout;
       unsigned int            sg_reserved_size;
       int                     node;
#ifdef CONFIG_BLK_DEV_IO_TRACE
       struct blk_trace        *blk_trace;
#endif
       /*
        * for flush operations
        */
       unsigned int            flush_flags;
       unsigned int            flush_not_queueable:1;
       struct blk_flush_queue  *fq;

       struct list_head        requeue_list;
       spinlock_t              requeue_lock;
       struct work_struct      requeue_work;

       struct mutex            sysfs_lock;

       int                     bypass_depth;
       atomic_t                mq_freeze_depth;

#if defined(CONFIG_BLK_DEV_BSG)
       bsg_job_fn              *bsg_job_fn;
       int                     bsg_job_size;
       struct bsg_class_device bsg_dev;
#endif

#ifdef CONFIG_BLK_DEV_THROTTLING
       /* Throttle data */
       struct throtl_data *td;
#endif
       struct rcu_head         rcu_head;
       wait_queue_head_t       mq_freeze_wq;
       struct percpu_ref       q_usage_counter;
       struct list_head        all_q_node;

       struct blk_mq_tag_set   *tag_set;
       struct list_head        tag_set_list;
       struct bio_set          *bio_split;

       bool                    mq_sysfs_init_done;
};

Initialization of Queue

When a new devie driver using the multi-queue API is loaded, It creates and initializes a new blk_mq_ops structure and sets to its address the associated pointer of a new blk_mq_reg.

More In detail, except for the below structure, other operations are now strictly required,

But, other operations can be specified in order to perform specific operations on allocation of contexts or on completion of an I/O request.

As of necessary data, the driver must initialize the number of submission queues it supports, along with their size.

other data are required, to determine the size of the command supported by the driver and specific flags that must be exposed to the block layer.

BUT, in kernel version 4.5, struct blk_mq_tag_set is important, The above job is implemented in this struct blk_mq_tag_set.

queue_fn

this must be set to a function in charge of handling the command, for example, by passing the command to the low-level driver.

map_queue

performs the mapping between hardware and software context.

blk_mq_init_queue function (in kernel 4.5 source)

after getting ready for gendisk and request_queue related to the device. It(driver) invokes the blk_mq_init_queue function (in kernel 4.5 source)

This function intializes the hardware and software contexts and performs the mapping between them.

This intialization routine also sets an alternate make_request function, subsituting to the conventional request submission path which would includes the function blk_make_request() (in kernel 4.5 code) the multi-queue submission path(which includes, instead, blk_mq_make_request() (in kernel 4.5 code)).

In other words, the alternate make_request function is set with blk_queue_make_reqeust() helper.

Submission of Request

Device initialization substituted the conventional block I/O submission function with blk_mq_make_request() (in kernel 4.5 code), letting the multi-queue structures be used the perspective of the upper layer.

The make_request function (in kernel 4.5 no exist) used by the mutil-queue block layer includes the possibility to benefit from per-process plugging,But only for drivers supporting a single hardware queue or for async requests.

In case the request is sync and the driver actively uses the multi-queue interface, No plugging is performed.

if plugging is allowed, The make_request function also performs request merging, searching for a candidate first inside the task’s plug list,

And finally in the software queue mapped to current CPU, The submission path does not involve any I/O sheduling-related call-back.

Finally, make_request sends immediately to th correspoding hardware queue any sync request, While it delays this transition in cas of async or flush requests, to allow for subsequent merging and more efficient dispatching.

Request dispatch

In case that an I/O request is synchronous (and therefore no plugging is allowed for it form the multi-queue block layer)

Its dispatch to the device driver is performed int the context is performed in the context of the same request.

If the request is instead async or flush, Task plugging is present,

the dispatch is performed as The following timing :

1 - In the context of the submission of another I/O reqeust to a software queue associated to the same hardware queue

2 - When the delayed work scheduled during reqeust submission is executed.

The main run-of-queue function of the multi-queue block layer is the blk_mq_run_hw_queue() (in kernel 4.5 code) which basically depends on another driver-specific routine, pointed by the queue_rq (in kernel 4.5) field of its blk_mq_ops structure

!!I have to check relationship between blk_mq_run_hw_queue and request_qu!!

!!This function delays any run of queue for an async reqeust, While it dispatches a sync reqeuest immediately to the driver.

The inner function __blk_mq_run_hw_queue() (in kernel 4.5 code), called by blk_mq_run_hw_queue() (in kernel 4.5 code) in case the reqeust is sync, first joins any software queue associated to the currently-in-service hardware queue, Then it joins the resulting list with any entry already on the dispatch list.

After collecting all to-be-served entries, the function __blk_mq_run_hw_queue() (in kernel 4.5 code) processes them(entries), starting each reqeust and passing it on to the driver, with its queue_rq function.

The function finally handles possible errors, by requeue or deletion of the associated requests.

Multi-Queu on block layer in Linux Kernel

What is the Multi-Queue, and code analysis of Multi-Queue on Block layer