On 9/8/21 5:53 AM, Sitsofe Wheeler wrote: > (CC'ing Jens directly in case he missed the previous messages) > > On Mon, 6 Sept 2021 at 15:28, Hans-Peter Lehmann > <hans-peter.lehmann@xxxxxxx> wrote: >> >> Hi Jens, >> >> not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks? >> >> Best regards >> Hans-Peter >> >> (I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads) Thanks for CC'ing me, I don't always see the messages otherwise. 580K is very low, but without having access to the system and being able to run some basic profiling, hard for me to say what you're running into. I may miss some details in the below, so please do ask followups if things are missing/unclear. 1) I'm using a 3970X with a desktop board + box for my peak testing, specs on that can be found online. 2) Yes I do run a custom configuration on my kernel, I do kernel development after all :-). I'm attaching the one I'm using. This hasn't changed in a long time. I do turn off various things that I don't need and some of them do impact performance. 3) The options I run t/io_uring with have been posted multiple times, it's this one: taskset -c 0 t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme3n1 which is QD=128, 32/32 submit/complete batching, polled IO, registered files and buffers. Note that you'll need to configure NVMe to properly use polling. I use 32 poll queues, number isn't really that important for single core testing, as long as there's enough to have a poll queue local to CPU being tested on. You'll see this in dmesg: nvme nvme3: 64/0/32 default/read/poll queues 4) Make sure your nvme device is using 'none' as the IO scheduler. I think this is a no-brainer, but mentioning it just in case. 5) I turn off iostats and merging for the device. iostats is the most important, depending on the platform getting accurate time stamps can be expensive: echo 0 > /sys/block/nvme3n1/queue/iostats echo 2 > /sys/block/nvme3n1/queue/nomerges 6) I do no special CPU frequency tuning. It's running stock settings, and the system is not overclocked or anything like that. I think that's about it. The above gets me 3.5M+ per core using polled IO and the current kernel, and around 2.3M per core if using IRQ driven IO. Note that the current kernel is important here, we've improved things a lot over the last year. That said, 580K is crazy low, and I bet there's something basic that's preventing it running faster. Is this a gen2 optane? One thing that might be useful is to run my t/io_uring from above, it'll tell you what the IO thread pid is: [...] submitter=2900332 [...] and then run # perf record -g -p 2900332 -- sleep 3 and afterwards do: # perf report -g --no-children > output and gzip the output and attach it here. With performance that low, should be pretty trivial to figure out what is going on here. -- Jens Axboe
Attachment:
amd-config.txt.gz
Description: application/gzip