On 1/23/24 1:03 PM, Oleksandr Natalenko wrote: > Hello. > > On ?ter? 23. ledna 2024 18:34:12 CET Jens Axboe wrote: >> Hi, >> >> It's no secret that mq-deadline doesn't scale very well - it was >> originally done as a proof-of-concept conversion from deadline, when the >> blk-mq multiqueue layer was written. In the single queue world, the >> queue lock protected the IO scheduler as well, and mq-deadline simply >> adopted an internal dd->lock to fill the place of that. >> >> While mq-deadline works under blk-mq and doesn't suffer any scaling on >> that side, as soon as request insertion or dispatch is done, we're >> hitting the per-queue dd->lock quite intensely. On a basic test box >> with 16 cores / 32 threads, running a number of IO intensive threads >> on either null_blk (single hw queue) or nvme0n1 (many hw queues) shows >> this quite easily: >> >> The test case looks like this: >> >> fio --bs=512 --group_reporting=1 --gtod_reduce=1 --invalidate=1 \ >> --ioengine=io_uring --norandommap --runtime=60 --rw=randread \ >> --thread --time_based=1 --buffered=0 --fixedbufs=1 --numjobs=32 \ >> --iodepth=4 --iodepth_batch_submit=4 --iodepth_batch_complete=4 \ >> --name=scaletest --filename=/dev/$DEV >> >> and is being run on a desktop 7950X box. >> >> which is 32 threads each doing 4 IOs, for a total queue depth of 128. >> >> Results before the patches: >> >> Device IOPS sys contention diff >> ==================================================== >> null_blk 879K 89% 93.6% >> nvme0n1 901K 86% 94.5% >> >> which looks pretty miserable, most of the time is spent contending on >> the queue lock. >> >> This RFC patchset attempts to address that by: >> >> 1) Serializing dispatch of requests. If we fail dispatching, rely on >> the next completion to dispatch the next one. This could potentially >> reduce the overall depth achieved on the device side, however even >> for the heavily contended test I'm running here, no observable >> change is seen. This is patch 2. >> >> 2) Serialize request insertion, using internal per-cpu lists to >> temporarily store requests until insertion can proceed. This is >> patch 3. >> >> 3) Skip expensive merges if the queue is already contended. Reasonings >> provided in that patch, patch 4. >> >> With that in place, the same test case now does: >> >> Device IOPS sys contention diff >> ==================================================== >> null_blk 2867K 11.1% ~6.0% +226% >> nvme0n1 3162K 9.9% ~5.0% +250% >> >> and while that doesn't completely eliminate the lock contention, it's >> oodles better than what it was before. The throughput increase shows >> that nicely, with more than a 200% improvement for both cases. >> >> Since the above is very high IOPS testing to show the scalability >> limitations, I also ran this on a more normal drive on a Dell R7525 test >> box. It doesn't change the performance there (around 66K IOPS), but >> it does reduce the system time required to do the IO from 12.6% to >> 10.7%, or about 20% less time spent in the kernel. >> >> block/mq-deadline.c | 178 +++++++++++++++++++++++++++++++++++++++----- >> 1 file changed, 161 insertions(+), 17 deletions(-) >> >> Since v2: >> - Update mq-deadline insertion locking optimization patch to >> use Bart's variant instead. This also drops the per-cpu >> buckets and hence resolves the need to potentially make >> the number of buckets dependent on the host. >> - Use locking bitops >> - Add similar series for BFQ, with good results as well >> - Rebase on 6.8-rc1 >> >> >> > > I'm running this for a couple of days with no issues, hence for the series: > > Tested-by: Oleksandr Natalenko <oleksandr@xxxxxxxxxxxxxx> That's great to know, thanks for testing! -- Jens Axboe