Re: [LSF/MM/BPF TOPIC] Block IO performance per core?

Jens Axboe <axboe@xxxxxxxxx> · Thu, 6 Feb 2025 13:09:25 -0700

On 2/6/25 6:11 AM, Nitesh Shetty wrote:
> Existing block layer stack, with single CPU and multiple NVMe devices,
> we are not able to extract the maximum device advertised IOPS.
> In the below example, SSD is capable of 5M IOPS (512b)[1].
> 
> a. With 1 thread/CPU, we are able to get 6.19M IOPS which can't saturate two
> devices[2].
> b. With 2 threads, 2 CPUs from same core, we get 6.89M IOPS[3].
> c. With 2 threads, 2 CPUs from different core, we are able to saturate two
> SSDs [4].
> 
> So single core will not be enough to saturate a backend with > 6.89M
> IOPS. With PCIe Gen6, we might see devices capable of ~6M IOPS. And
> roughly double of that with Gen7.
> 
> There have been past attempts to improve efficiency, which did not move
> forward:
> a. DMA pre-mapping [5]: to avoid the per I/O DMA cost
> b. IO-uring attached NVMe queues[6] : to reduce the code needed to do the I/O and trim the kernel-config dependency.
> 
> So the discussion points are
> 
> - Should some of the above be revisited?
> - Do we expect new DMA API [7] to improve the efficiency?
> - It seems iov_iter[8] work may also help?
> - Are there other thoughts to reduce the extra core that we take now?
> 
> 
> Thanks,
> Nitesh
> 
> [1]
> Note: Obtained by disabling kernel config like blk-cgroups and
> write-back throttling
> 
> sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 -r3
> /dev/nvme0n1
> submitter=0, tid=3584444, file=/dev/nvme0n1, nfiles=1, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=4.99M, BW=2.44GiB/s, IOS/call=32/31
> IOPS=5.02M, BW=2.45GiB/s, IOS/call=32/32
> Exiting on timeout
> Maximum IOPS=5.02M
> 
> [2]
> sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 -r3
> /dev/nvme0n1 /dev/nvme1n1
> submitter=0, tid=3958383, file=/dev/nvme1n1, nfiles=2, node=-1
> polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
> Engine=io_uring, sq_ring=128, cq_ring=128
> IOPS=6.19M, BW=3.02GiB/s, IOS/call=32/31
> IOPS=6.18M, BW=3.02GiB/s, IOS/call=32/32
> Exiting on timeout
> Maximum IOPS=6.19M
> 
> [3]
> Note: 0,1 CPUs are mapped to same core
>  sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>  -r3  /dev/nvme0n1 /dev/nvme1n1
>  submitter=1, tid=3708980, file=/dev/nvme1n1, nfiles=1, node=-1
>  submitter=0, tid=3708979, file=/dev/nvme0n1, nfiles=1, node=-1polled=1,
>  fixedbufs=1, register_files=1, buffered=0, QD=128
>  Engine=io_uring, sq_ring=128, cq_ring=128
>  IOPS=6.86M, BW=3.35GiB/s, IOS/call=32/31
>  IOPS=6.89M, BW=3.36GiB/s, IOS/call=32/31
>  Exiting on timeout
>  Maximum IOPS=6.89M
> 
> [4]
> Note: 0,2 CPUs are mapped to different cores
>  sudo taskset -c 0,2 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
>  -r3 /dev/nvme0n1 /dev/nvme1n1
>  submitter=0, tid=3588355, file=/dev/nvme0n1, nfiles=1, node=-1
>  submitter=1, tid=3588356, file=/dev/nvme1n1, nfiles=1, node=-1polled=1,
>  fixedbufs=1, register_files=1, buffered=0, QD=128
>  Engine=io_uring, sq_ring=128, cq_ring=128
>  IOPS=9.89M, BW=4.83GiB/s, IOS/call=31/31
>  IOPS=10.00M, BW=4.88GiB/s, IOS/call=31/31
>  Exiting on timeout
>  Maximum IOPS=10.00M

While I'm always interested in making per-core IOPS better as it relates
to better efficiency in the IO stack, and have done a LOT of work in
this area in the past, for this particular case it's also worth
highlighting that I bet you could get a lot better performance by doing
something smarter with polling multiple devices than what t/io_uring is
currently doing - completing 32 requests on each device before moving on
to the other one is probably not the best approach. t/io_uring is simply
not designed very well for that.

IOW, I do like this topic, but I think it'd be worthwhile to generate
some better numbers with a more targeted approach to polling multiple
devices from a single thread first rather than take t/io_uring in its
current form as gospel on that front.

-- 
Jens Axboe