Re: [PATCH 4/8] blk-mq: Facilitate a shared sbitmap per tagset

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 29 Nov 2019 08:25:40 +0800

On Wed, Nov 27, 2019 at 06:02:54PM +0100, Hannes Reinecke wrote:
> On 11/26/19 4:54 PM, Ming Lei wrote:
> > On Tue, Nov 26, 2019 at 12:27:50PM +0100, Hannes Reinecke wrote:
> > > On 11/26/19 12:05 PM, Ming Lei wrote:
> [ .. ]
> > > >  From performance viewpoint, all hctx belonging to this request queue should
> > > > share one scheduler tagset in case of BLK_MQ_F_TAG_HCTX_SHARED, cause
> > > > driver tag queue depth isn't changed.
> > > > 
> > > Hmm. Now you get me confused.
> > > In an earlier mail you said:
> > > 
> > > > This kind of sharing is wrong, sched tags should be request
> > > > queue wide instead of tagset wide, and each request queue has
> > > > its own & independent scheduler queue.
> > > 
> > > as in v2 we _had_ shared scheduler tags, too.
> > > Did I misread your comment above?
> > 
> > Yes, what I meant is that we can't share sched tags in tagset wide.
> > 
> > Now I mean we should share sched tags among all hctxs in same request
> > queue, and I believe I have described it clearly.
> > 
> I wonder if this makes a big difference; in the end, scheduler tags are
> primarily there to allow the scheduler to queue more requests, and
> potentially merge them. These tags are later converted into 'real' ones via
> blk_mq_get_driver_tag(), and only then the resource limitation takes hold.
> Wouldn't it be sufficient to look at the number of outstanding commands per
> queue when getting a scheduler tag, and not having to implement yet another
> bitmap?

Firstly too much((nr_hw_queues - 1) times) memory is wasted. Secondly IO
latency could be increased by too deep scheduler queue depth. Finally CPU
could be wasted in the retrying of running busy hw queue.

Wrt. driver tags, this patch may be worse, given the average limit for
each LUN is reduced by (nr_hw_queues) times, see hctx_may_queue().

Another change is bt_wait_ptr(). Before your patches, there is single
.wait_index, now the number of .wait_index is changed to nr_hw_queues.

Also the run queue number is increased a lot in SCSI's IO completion, see
scsi_end_request().

Kashyap Desai has performance benchmark on fast megaraid SSD, and you can
ask him to provide performance data for this patches.

Thanks,
Ming