Re: [PATCHSET v3] mq-deadline and BFQ scalability improvements

Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> · Tue, 23 Jan 2024 21:03:51 +0100

Hello.

On úterý 23. ledna 2024 18:34:12 CET Jens Axboe wrote:
> Hi,
> 
> It's no secret that mq-deadline doesn't scale very well - it was
> originally done as a proof-of-concept conversion from deadline, when the
> blk-mq multiqueue layer was written. In the single queue world, the
> queue lock protected the IO scheduler as well, and mq-deadline simply
> adopted an internal dd->lock to fill the place of that.
> 
> While mq-deadline works under blk-mq and doesn't suffer any scaling on
> that side, as soon as request insertion or dispatch is done, we're
> hitting the per-queue dd->lock quite intensely. On a basic test box
> with 16 cores / 32 threads, running a number of IO intensive threads
> on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
> this quite easily:
> 
> The test case looks like this:
> 
> fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
> 	--ioengine=io_uring --norandommap --runtime=60 --rw=randread \
> 	--thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \
> 	--iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \
> 	--name=scaletest --filename=/dev/$DEV
> 
> and is being run on a desktop 7950X box.
> 
> which is 32 threads each doing 4 IOs, for a total queue depth of 128.
> 
> Results before the patches:
> 
> Device		IOPS	sys	contention	diff
> ====================================================
> null_blk	879K	89%	93.6%
> nvme0n1		901K	86%	94.5%
> 
> which looks pretty miserable, most of the time is spent contending on
> the queue lock.
> 
> This RFC patchset attempts to address that by:
> 
> 1) Serializing dispatch of requests. If we fail dispatching, rely on
>    the next completion to dispatch the next one. This could potentially
>    reduce the overall depth achieved on the device side, however even
>    for the heavily contended test I'm running here, no observable
>    change is seen. This is patch 2.
> 
> 2) Serialize request insertion, using internal per-cpu lists to
>    temporarily store requests until insertion can proceed. This is
>    patch 3.
> 
> 3) Skip expensive merges if the queue is already contended. Reasonings
>    provided in that patch, patch 4.
> 
> With that in place, the same test case now does:
> 
> Device		IOPS	sys	contention	diff
> ====================================================
> null_blk	2867K	11.1%	~6.0%		+226%
> nvme0n1		3162K	 9.9%	~5.0%		+250%
> 
> and while that doesn't completely eliminate the lock contention, it's
> oodles better than what it was before. The throughput increase shows
> that nicely, with more than a 200% improvement for both cases.
> 
> Since the above is very high IOPS testing to show the scalability
> limitations, I also ran this on a more normal drive on a Dell R7525 test
> box. It doesn't change the performance there (around 66K IOPS), but
> it does reduce the system time required to do the IO from 12.6% to
> 10.7%, or about 20% less time spent in the kernel.
> 
>  block/mq-deadline.c | 178 +++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 161 insertions(+), 17 deletions(-)
> 
> Since v2:
> 	- Update mq-deadline insertion locking optimization patch to
> 	  use Bart's variant instead. This also drops the per-cpu
> 	  buckets and hence resolves the need to potentially make
> 	  the number of buckets dependent on the host.
> 	- Use locking bitops
> 	- Add similar series for BFQ, with good results as well
> 	- Rebase on 6.8-rc1
> 
> 
> 

I'm running this for a couple of days with no issues, hence for the series:

Tested-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx>

Thank you.

-- 
Oleksandr Natalenko (post-factum)
Attachment:
signature.asc

Description: This is a digitally signed message part.