Re: [PATCH 4/8] blk-mq: Facilitate a shared sbitmap per tagset

John Garry <john.garry@xxxxxxxxxx> · Fri, 29 Nov 2019 09:21:43 +0000

On 29/11/2019 00:25, Ming Lei wrote:
On Wed, Nov 27, 2019 at 06:02:54PM +0100, Hannes Reinecke wrote:
On 11/26/19 4:54 PM, Ming Lei wrote:
On Tue, Nov 26, 2019 at 12:27:50PM +0100, Hannes Reinecke wrote:
On 11/26/19 12:05 PM, Ming Lei wrote:
[ .. ]
  From performance viewpoint, all hctx belonging to this request queue should
share one scheduler tagset in case of BLK_MQ_F_TAG_HCTX_SHARED, cause
driver tag queue depth isn't changed.

Hmm. Now you get me confused.
In an earlier mail you said:

This kind of sharing is wrong, sched tags should be request
queue wide instead of tagset wide, and each request queue has
its own & independent scheduler queue.

as in v2 we _had_ shared scheduler tags, too.
Did I misread your comment above?

Yes, what I meant is that we can't share sched tags in tagset wide.

Now I mean we should share sched tags among all hctxs in same request
queue, and I believe I have described it clearly.

I wonder if this makes a big difference; in the end, scheduler tags are
primarily there to allow the scheduler to queue more requests, and
potentially merge them. These tags are later converted into 'real' ones via
blk_mq_get_driver_tag(), and only then the resource limitation takes hold.
Wouldn't it be sufficient to look at the number of outstanding commands per
queue when getting a scheduler tag, and not having to implement yet another
bitmap?

Firstly too much((nr_hw_queues - 1) times) memory is wasted. Secondly IO
latency could be increased by too deep scheduler queue depth. Finally CPU
could be wasted in the retrying of running busy hw queue.

Wrt. driver tags, this patch may be worse, given the average limit for
each LUN is reduced by (nr_hw_queues) times, see hctx_may_queue().

Another change is bt_wait_ptr(). Before your patches, there is single
.wait_index, now the number of .wait_index is changed to nr_hw_queues.

Also the run queue number is increased a lot in SCSI's IO completion, see
scsi_end_request().

Kashyap Desai has performance benchmark on fast megaraid SSD, and you can
ask him to provide performance data for this patches.

On v2 series (which is effectively same as this one [it would be nice if 
we had per-patch versioning]), for hisi_sas_v3_hw we get about same 
performance as when we use the reply_map, about 3.0M IOPS vs 3.1M IOPS, 
respectively.

Without this, we get 700/800K IOPS. I don't know why the performance is 
so poor without. Only CPU0 serves the completion interrupts, which could 
explain it, but v2 hw can get > 800K IOPS with only 6x SSDs.

Thanks,
John