On 2/6/25 6:11 AM, Nitesh Shetty wrote: > Existing block layer stack, with single CPU and multiple NVMe devices, > we are not able to extract the maximum device advertised IOPS. > In the below example, SSD is capable of 5M IOPS (512b)[1]. > > a. With 1 thread/CPU, we are able to get 6.19M IOPS which can't saturate two > devices[2]. > b. With 2 threads, 2 CPUs from same core, we get 6.89M IOPS[3]. > c. With 2 threads, 2 CPUs from different core, we are able to saturate two > SSDs [4]. > > So single core will not be enough to saturate a backend with > 6.89M > IOPS. With PCIe Gen6, we might see devices capable of ~6M IOPS. And > roughly double of that with Gen7. > > There have been past attempts to improve efficiency, which did not move > forward: > a. DMA pre-mapping [5]: to avoid the per I/O DMA cost > b. IO-uring attached NVMe queues[6] : to reduce the code needed to do the I/O and trim the kernel-config dependency. > > So the discussion points are > > - Should some of the above be revisited? > - Do we expect new DMA API [7] to improve the efficiency? > - It seems iov_iter[8] work may also help? > - Are there other thoughts to reduce the extra core that we take now? > > > Thanks, > Nitesh > > [1] > Note: Obtained by disabling kernel config like blk-cgroups and > write-back throttling > > sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 -r3 > /dev/nvme0n1 > submitter=0, tid=3584444, file=/dev/nvme0n1, nfiles=1, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=4.99M, BW=2.44GiB/s, IOS/call=32/31 > IOPS=5.02M, BW=2.45GiB/s, IOS/call=32/32 > Exiting on timeout > Maximum IOPS=5.02M > > [2] > sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 -r3 > /dev/nvme0n1 /dev/nvme1n1 > submitter=0, tid=3958383, file=/dev/nvme1n1, nfiles=2, node=-1 > polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=6.19M, BW=3.02GiB/s, IOS/call=32/31 > IOPS=6.18M, BW=3.02GiB/s, IOS/call=32/32 > Exiting on timeout > Maximum IOPS=6.19M > > [3] > Note: 0,1 CPUs are mapped to same core > sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 > -r3 /dev/nvme0n1 /dev/nvme1n1 > submitter=1, tid=3708980, file=/dev/nvme1n1, nfiles=1, node=-1 > submitter=0, tid=3708979, file=/dev/nvme0n1, nfiles=1, node=-1polled=1, > fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=6.86M, BW=3.35GiB/s, IOS/call=32/31 > IOPS=6.89M, BW=3.36GiB/s, IOS/call=32/31 > Exiting on timeout > Maximum IOPS=6.89M > > [4] > Note: 0,2 CPUs are mapped to different cores > sudo taskset -c 0,2 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2 > -r3 /dev/nvme0n1 /dev/nvme1n1 > submitter=0, tid=3588355, file=/dev/nvme0n1, nfiles=1, node=-1 > submitter=1, tid=3588356, file=/dev/nvme1n1, nfiles=1, node=-1polled=1, > fixedbufs=1, register_files=1, buffered=0, QD=128 > Engine=io_uring, sq_ring=128, cq_ring=128 > IOPS=9.89M, BW=4.83GiB/s, IOS/call=31/31 > IOPS=10.00M, BW=4.88GiB/s, IOS/call=31/31 > Exiting on timeout > Maximum IOPS=10.00M While I'm always interested in making per-core IOPS better as it relates to better efficiency in the IO stack, and have done a LOT of work in this area in the past, for this particular case it's also worth highlighting that I bet you could get a lot better performance by doing something smarter with polling multiple devices than what t/io_uring is currently doing - completing 32 requests on each device before moving on to the other one is probably not the best approach. t/io_uring is simply not designed very well for that. IOW, I do like this topic, but I think it'd be worthwhile to generate some better numbers with a more targeted approach to polling multiple devices from a single thread first rather than take t/io_uring in its current form as gospel on that front. -- Jens Axboe