Hello. On úterý 23. ledna 2024 18:34:12 CET Jens Axboe wrote: > Hi, > > It's no secret that mq-deadline doesn't scale very well - it was > originally done as a proof-of-concept conversion from deadline, when the > blk-mq multiqueue layer was written. In the single queue world, the > queue lock protected the IO scheduler as well, and mq-deadline simply > adopted an internal dd->lock to fill the place of that. > > While mq-deadline works under blk-mq and doesn't suffer any scaling on > that side, as soon as request insertion or dispatch is done, we're > hitting the per-queue dd->lock quite intensely. On a basic test box > with 16 cores / 32 threads, running a number of IO intensive threads > on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows > this quite easily: > > The test case looks like this: > > fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \ > --ioengine=io_uring --norandommap --runtime=60 --rw=randread \ > --thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \ > --iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \ > --name=scaletest --filename=/dev/$DEV > > and is being run on a desktop 7950X box. > > which is 32 threads each doing 4 IOs, for a total queue depth of 128. > > Results before the patches: > > Device IOPS sys contention diff > ==================================================== > null_blk 879K 89% 93.6% > nvme0n1 901K 86% 94.5% > > which looks pretty miserable, most of the time is spent contending on > the queue lock. > > This RFC patchset attempts to address that by: > > 1) Serializing dispatch of requests. If we fail dispatching, rely on > the next completion to dispatch the next one. This could potentially > reduce the overall depth achieved on the device side, however even > for the heavily contended test I'm running here, no observable > change is seen. This is patch 2. > > 2) Serialize request insertion, using internal per-cpu lists to > temporarily store requests until insertion can proceed. This is > patch 3. > > 3) Skip expensive merges if the queue is already contended. Reasonings > provided in that patch, patch 4. > > With that in place, the same test case now does: > > Device IOPS sys contention diff > ==================================================== > null_blk 2867K 11.1% ~6.0% +226% > nvme0n1 3162K 9.9% ~5.0% +250% > > and while that doesn't completely eliminate the lock contention, it's > oodles better than what it was before. The throughput increase shows > that nicely, with more than a 200% improvement for both cases. > > Since the above is very high IOPS testing to show the scalability > limitations, I also ran this on a more normal drive on a Dell R7525 test > box. It doesn't change the performance there (around 66K IOPS), but > it does reduce the system time required to do the IO from 12.6% to > 10.7%, or about 20% less time spent in the kernel. > > block/mq-deadline.c | 178 +++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 161 insertions(+), 17 deletions(-) > > Since v2: > - Update mq-deadline insertion locking optimization patch to > use Bart's variant instead. This also drops the per-cpu > buckets and hence resolves the need to potentially make > the number of buckets dependent on the host. > - Use locking bitops > - Add similar series for BFQ, with good results as well > - Rebase on 6.8-rc1 > > > I'm running this for a couple of days with no issues, hence for the series: Tested-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> Thank you. -- Oleksandr Natalenko (post-factum)
Attachment:
signature.asc
Description: This is a digitally signed message part.