Re: [PATCH] blk-mq: plug request for shared sbitmap

Hannes Reinecke <hare@xxxxxxx> · Tue, 18 May 2021 13:54:33 +0200

On 5/18/21 1:16 PM, Ming Lei wrote:
> On Tue, May 18, 2021 at 10:44:43AM +0100, John Garry wrote:
>> On 14/05/2021 03:20, Ming Lei wrote:
>>> In case of shared sbitmap, request won't be held in plug list any more
>>> sine commit 32bc15afed04 ("blk-mq: Facilitate a shared sbitmap per
>>> tagset"), this way makes request merge from flush plug list & batching
>>> submission not possible, so cause performance regression.
>>>
>>> Yanhui reports performance regression when running sequential IO
>>> test(libaio, 16 jobs, 8 depth for each job) in VM, and the VM disk
>>> is emulated with image stored on xfs/megaraid_sas.
>>>
>>> Fix the issue by recovering original behavior to allow to hold request
>>> in plug list.
>>
>> Hi Ming,
>>
>> Since testing v5.13-rc2, I noticed that this patch made the hang I was
>> seeing disappear:
>> https://lore.kernel.org/linux-scsi/3d72d64d-314f-9d34-e039-7e508b2abe1b@xxxxxxxxxx/
>>
>> I don't think that problem is solved, though.
> 
> This kind of hang or lockup is usually related with cpu utilization, and
> this patch may reduce cpu utilization in submission context.
> 
>>
>> So I wonder about throughput performance (I had hoped to test before it was
>> merged). I only have 6x SAS SSDs at hand, but I see some significant changes
>> (good and bad) for mq-deadline for hisi_sas:
>> Before 620K (read), 300K IOPs (randread)
>> After 430K (read), 460-490K IOPs (randread)
> 
> 'Before 620K' could be caused by busy queue when batching submission isn't
> applied, so merge chance is increased. This patch applies batching
> submission, so queue becomes not busy enough.
> 
> BTW, what is the queue depth of sdev and can_queue of shost for your hisilision SAS?
>  
>>
>> none IO sched is always about 450K (read) and 500K (randread)
>>
>> Do you guys have any figures? Are my results as expected?
> 
> In yanhui's virt workload(qemu, libaio, dio, high queue depth, single
> job), the patch can improve throughput much(>50%) when running
> sequential write(dio, libaio, 16 jobs) to XFS. And it is observed that
> IO merge is recovered to level of disabling host tags.
> 
I've found a good testcase for this by using null_blk:

modprobe null_blk submit_queues=32 queue_mode=2 irqmode=0 bs=4096
hw_queue_depth=2048 completion_nsec=1 nr_devices=4 shared_tags=1
shared_tag_bitmap=1

and using a simple fio job with libaio and rw=read and numjobs=32 will
do the trick:

[nullb]
rw=read
size=16g
ioengine=libaio
direct=1
buffered=0
group_reporting=1
bs=4096
numjobs=32

(of course, the 'numjobs' and 'submit_queues' parameter would need to be
adjusted to your machine).
Omitting the 'shared_tag_bitmap' module parameter would yield around 12M
iops; adding it would see a rather dramatic drop to 172k iops.
With this patchset it's back to the expected iops value.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		        Kernel Storage Architect
hare@xxxxxxx			               +49 911 74053 688
SUSE Software Solutions Germany GmbH, 90409 Nürnberg
GF: F. Imendörffer, HRB 36809 (AG Nürnberg)