Re: [PATCHSET v3] mq-deadline and BFQ scalability improvements

Jens Axboe <axboe@xxxxxxxxx> · Tue, 23 Jan 2024 15:14:27 -0700

On 1/23/24 1:03 PM, Oleksandr Natalenko wrote:
> Hello.
> 
> On ?ter? 23. ledna 2024 18:34:12 CET Jens Axboe wrote:
>> Hi,
>>
>> It's no secret that mq-deadline doesn't scale very well - it was
>> originally done as a proof-of-concept conversion from deadline, when the
>> blk-mq multiqueue layer was written. In the single queue world, the
>> queue lock protected the IO scheduler as well, and mq-deadline simply
>> adopted an internal dd->lock to fill the place of that.
>>
>> While mq-deadline works under blk-mq and doesn't suffer any scaling on
>> that side, as soon as request insertion or dispatch is done, we're
>> hitting the per-queue dd->lock quite intensely. On a basic test box
>> with 16 cores / 32 threads, running a number of IO intensive threads
>> on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
>> this quite easily:
>>
>> The test case looks like this:
>>
>> fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
>> 	--ioengine=io_uring --norandommap --runtime=60 --rw=randread \
>> 	--thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \
>> 	--iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \
>> 	--name=scaletest --filename=/dev/$DEV
>>
>> and is being run on a desktop 7950X box.
>>
>> which is 32 threads each doing 4 IOs, for a total queue depth of 128.
>>
>> Results before the patches:
>>
>> Device		IOPS	sys	contention	diff
>> ====================================================
>> null_blk	879K	89%	93.6%
>> nvme0n1		901K	86%	94.5%
>>
>> which looks pretty miserable, most of the time is spent contending on
>> the queue lock.
>>
>> This RFC patchset attempts to address that by:
>>
>> 1) Serializing dispatch of requests. If we fail dispatching, rely on
>>    the next completion to dispatch the next one. This could potentially
>>    reduce the overall depth achieved on the device side, however even
>>    for the heavily contended test I'm running here, no observable
>>    change is seen. This is patch 2.
>>
>> 2) Serialize request insertion, using internal per-cpu lists to
>>    temporarily store requests until insertion can proceed. This is
>>    patch 3.
>>
>> 3) Skip expensive merges if the queue is already contended. Reasonings
>>    provided in that patch, patch 4.
>>
>> With that in place, the same test case now does:
>>
>> Device		IOPS	sys	contention	diff
>> ====================================================
>> null_blk	2867K	11.1%	~6.0%		+226%
>> nvme0n1		3162K	 9.9%	~5.0%		+250%
>>
>> and while that doesn't completely eliminate the lock contention, it's
>> oodles better than what it was before. The throughput increase shows
>> that nicely, with more than a 200% improvement for both cases.
>>
>> Since the above is very high IOPS testing to show the scalability
>> limitations, I also ran this on a more normal drive on a Dell R7525 test
>> box. It doesn't change the performance there (around 66K IOPS), but
>> it does reduce the system time required to do the IO from 12.6% to
>> 10.7%, or about 20% less time spent in the kernel.
>>
>>  block/mq-deadline.c | 178 +++++++++++++++++++++++++++++++++++++++-----
>>  1 file changed, 161 insertions(+), 17 deletions(-)
>>
>> Since v2:
>> 	- Update mq-deadline insertion locking optimization patch to
>> 	  use Bart's variant instead. This also drops the per-cpu
>> 	  buckets and hence resolves the need to potentially make
>> 	  the number of buckets dependent on the host.
>> 	- Use locking bitops
>> 	- Add similar series for BFQ, with good results as well
>> 	- Rebase on 6.8-rc1
>>
>>
>>
> 
> I'm running this for a couple of days with no issues, hence for the series:
> 
> Tested-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>

That's great to know, thanks for testing!

-- 
Jens Axboe