Hi Jens, thank you for your reply. Given that you have read the thread after the first reply, I think some of the questions of your first email are no longer relevant. I still answered them at the bottom for completeness, but I will answer the more interesting ones first.
I turn off iostats and merging for the device.
Doing this helped quite a bit. The 512b reads went from 715K to 800K. The 4096b reads went from 570K to 630K.
Note that you'll need to configure NVMe
to properly use polling. I use 32 poll queues, number isn't really that important for single core testing, as long as there's enough to have a poll queue local to CPU being tested on. My SSD was configured to use 128/0/0 default/read/poll queues. I added "nvme.poll_queues=32" to GRUB and rebooted, which changed it to 96/0/32. I now get 1.0M IOPS (512b blocks) and 790K IOPS (4096b blocks) using a single core. Thank you very much, this probably was the main bottleneck. Launching the benchmark two times with 512b blocks, I get 1.4M IOPS total. Starting single-threaded t/io_uring with two SSDs still achieves "only" 1.0M IOPS, independently of the block size. In your benchmarks from 2019 [0] when Linux 5.4 (which I am using) was current, you achieved 1.6M IOPS (4096b blocks) using a single core. I get the full 1.6M IOPS for saturating both SSDs (4096b blocks) only when running t/io_uring with two threads. This makes me think that there is still another configuration option that I am missing. Most time is spent in the kernel. # time taskset -c 48 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme0n1 /dev/nvme1n1 i 8, argc 10 Added file /dev/nvme0n1 (submitter 0) Added file /dev/nvme1n1 (submitter 0) sq_ring ptr = 0x0x7f78fb740000 sqes ptr = 0x0x7f78fb73e000 cq_ring ptr = 0x0x7f78fb73c000 polled=1, fixedbufs=1, register_files=1, buffered=0 QD=128, sq_ring=128, cq_ring=256 submitter=2336 IOPS=1014252, IOS/call=31/31, inflight=102 (38, 64) IOPS=1017984, IOS/call=31/31, inflight=123 (64, 59) IOPS=1018220, IOS/call=31/31, inflight=102 (38, 64) [...] real 0m7.898s user 0m0.144s sys 0m7.661s I attached a perf output to the email. It was generated using the same parameters as above (getting 1.0M IOPS). Thank you very much for your help. I am looking forward to hearing from you again to be able fully reproduce your measurements soon. Hans-Peter === Answers to (I think) no longer relevant questions ===
The options I run t/io_uring with have been posted multiple times, it's this one
This is the same configuration that I also ran (I did not explicitly specify the parameters that are the same as the default).
Make sure your nvme device is using 'none' as the IO scheduler.
The scheduler is set to 'none'.
Is this a gen2 optane?
It is not an optane disk but I also do not expect to get insanely high numbers like in your recent benchmarks. Just more close to the old benchmarks but using two SSDs. === References === [0]: https://twitter.com/axboe/status/1174777844313911296
Attachment:
perf-output.gz
Description: application/gzip