Existing block layer stack, with single CPU and multiple NVMe devices,
we are not able to extract the maximum device advertised IOPS.
In the below example, SSD is capable of 5M IOPS (512b)[1].
a. With 1 thread/CPU, we are able to get 6.19M IOPS which can't saturate two
devices[2].
b. With 2 threads, 2 CPUs from same core, we get 6.89M IOPS[3].
c. With 2 threads, 2 CPUs from different core, we are able to saturate two
SSDs [4].
So single core will not be enough to saturate a backend with > 6.89M
IOPS. With PCIe Gen6, we might see devices capable of ~6M IOPS. And
roughly double of that with Gen7.
There have been past attempts to improve efficiency, which did not move
forward:
a. DMA pre-mapping [5]: to avoid the per I/O DMA cost
b. IO-uring attached NVMe queues[6] : to reduce the code needed to
do the I/O and trim the kernel-config dependency.
So the discussion points are
- Should some of the above be revisited?
- Do we expect new DMA API [7] to improve the efficiency?
- It seems iov_iter[8] work may also help?
- Are there other thoughts to reduce the extra core that we take now?
Thanks,
Nitesh
[1]
Note: Obtained by disabling kernel config like blk-cgroups and
write-back throttling
sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 -r3
/dev/nvme0n1
submitter=0, tid=3584444, file=/dev/nvme0n1, nfiles=1, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=4.99M, BW=2.44GiB/s, IOS/call=32/31
IOPS=5.02M, BW=2.45GiB/s, IOS/call=32/32
Exiting on timeout
Maximum IOPS=5.02M
[2]
sudo taskset -c 0 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n1 -r3
/dev/nvme0n1 /dev/nvme1n1
submitter=0, tid=3958383, file=/dev/nvme1n1, nfiles=2, node=-1
polled=1, fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=6.19M, BW=3.02GiB/s, IOS/call=32/31
IOPS=6.18M, BW=3.02GiB/s, IOS/call=32/32
Exiting on timeout
Maximum IOPS=6.19M
[3]
Note: 0,1 CPUs are mapped to same core
sudo taskset -c 0,1 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
-r3 /dev/nvme0n1 /dev/nvme1n1
submitter=1, tid=3708980, file=/dev/nvme1n1, nfiles=1, node=-1
submitter=0, tid=3708979, file=/dev/nvme0n1, nfiles=1, node=-1polled=1,
fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=6.86M, BW=3.35GiB/s, IOS/call=32/31
IOPS=6.89M, BW=3.36GiB/s, IOS/call=32/31
Exiting on timeout
Maximum IOPS=6.89M
[4]
Note: 0,2 CPUs are mapped to different cores
sudo taskset -c 0,2 ./t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 -n2
-r3 /dev/nvme0n1 /dev/nvme1n1
submitter=0, tid=3588355, file=/dev/nvme0n1, nfiles=1, node=-1
submitter=1, tid=3588356, file=/dev/nvme1n1, nfiles=1, node=-1polled=1,
fixedbufs=1, register_files=1, buffered=0, QD=128
Engine=io_uring, sq_ring=128, cq_ring=128
IOPS=9.89M, BW=4.83GiB/s, IOS/call=31/31
IOPS=10.00M, BW=4.88GiB/s, IOS/call=31/31
Exiting on timeout
Maximum IOPS=10.00M
[5] https://lore.kernel.org/all/20220805162444.3985535-1-kbusch@xxxxxx/
[6]
https://lore.kernel.org/linux-block/20230429093925.133327-1-joshi.k@xxxxxxxxxxx/
[7]
https://lore.kernel.org/linux-nvme/20250122071600.GC10702@unreal/
https://lore.kernel.org/linux-nvme/cover.1738765879.git.leonro@xxxxxxxxxx/
[8]
https://lore.kernel.org/linux-block/886959.1737148612@xxxxxxxxxxxxxxxxxxxxxx/