[PATCHSET RFC v2 0/4] mq-deadline scalability improvements

Jens Axboe <axboe@xxxxxxxxx> · Fri, 19 Jan 2024 09:02:05 -0700

Hi,

It's no secret that mq-deadline doesn't scale very well - it was
originally done as a proof-of-concept conversion from deadline, when the
blk-mq multiqueue layer was written. In the single queue world, the
queue lock protected the IO scheduler as well, and mq-deadline simply
adopted an internal dd->lock to fill the place of that.

While mq-deadline works under blk-mq and doesn't suffer any scaling on
that side, as soon as request insertion or dispatch is done, we're
hitting the per-queue dd->lock quite intensely. On a basic test box
with 16 cores / 32 threads, running a number of IO intensive threads
on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows
this quite easily:

The test case looks like this:

fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \
	--ioengine=io_uring --norandommap --runtime=60 --rw=randread \
	--thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \
	--iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \
	--name=scaletest --filename=/dev/$DEV

and is being run on a desktop 7950X box.

which is 32 threads each doing 4 IOs, for a total queue depth of 128.

Results before the patches:

Device		IOPS	sys	contention	diff
====================================================
null_blk	879K	89%	93.6%
nvme0n1		901K	86%	94.5%

which looks pretty miserable, most of the time is spent contending on
the queue lock.

This RFC patchset attempts to address that by:

1) Serializing dispatch of requests. If we fail dispatching, rely on
   the next completion to dispatch the next one. This could potentially
   reduce the overall depth achieved on the device side, however even
   for the heavily contended test I'm running here, no observable
   change is seen. This is patch 2.

2) Serialize request insertion, using internal per-cpu lists to
   temporarily store requests until insertion can proceed. This is
   patch 3.

3) Skip expensive merges if the queue is already contended. Reasonings
   provided in that patch, patch 4.

With that in place, the same test case now does:

Device		IOPS	sys	contention	diff
====================================================
null_blk	2311K	10.3%	21.1%		+257%
nvme0n1		2610K	11.0%	24.6%		+289%

and while that doesn't completely eliminate the lock contention, it's
oodles better than what it was before. The throughput increase shows
that nicely, with more than a 200% improvement for both cases.

Since the above is very high IOPS testing to show the scalability
limitations, I also ran this on a more normal drive on a Dell R7525 test
box. It doesn't change the performance there (around 66K IOPS), but
it does reduce the system time required to do the IO from 12.6% to
10.7%, or about 20% less time spent in the kernel.

TODO: probably make DD_CPU_BUCKETS depend on number of online CPUs
in the system, rather than just default to 32.

 block/mq-deadline.c | 178 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 161 insertions(+), 17 deletions(-)

Since v1:
	- Redid numbers as I enabled fixedbufs for the workload,
	  and did not disable expensive merges but rather left it
	  at the kernel default settings.
	- Cleanup the insertion handling, also more efficient now.
	- Add merge contention patch
	- Expanded some of the commit messages
	- Add/expand code comments
	- Rebase to current master as block bits landed

-- 
Jens Axboe