Re: Question: t/io_uring performance

Jens Axboe <axboe@xxxxxxxxx> · Wed, 8 Sep 2021 06:22:53 -0600

On 9/8/21 5:53 AM, Sitsofe Wheeler wrote:
> (CC'ing Jens directly in case he missed the previous messages)
> 
> On Mon, 6 Sept 2021 at 15:28, Hans-Peter Lehmann
> <hans-peter.lehmann@xxxxxxx> wrote:
>>
>> Hi Jens,
>>
>> not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks?
>>
>> Best regards
>> Hans-Peter
>>
>> (I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads)

Thanks for CC'ing me, I don't always see the messages otherwise. 580K is
very low, but without having access to the system and being able to run
some basic profiling, hard for me to say what you're running into. I may
miss some details in the below, so please do ask followups if things are
missing/unclear.

1) I'm using a 3970X with a desktop board + box for my peak testing,
   specs on that can be found online.

2) Yes I do run a custom configuration on my kernel, I do kernel
   development after all :-). I'm attaching the one I'm using. This
   hasn't changed in a long time. I do turn off various things that I
   don't need and some of them do impact performance.

3) The options I run t/io_uring with have been posted multiple times,
   it's this one:

   taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme3n1

   which is QD=128, 32/32 submit/complete batching, polled IO,
   registered files and buffers. Note that you'll need to configure NVMe
   to properly use polling. I use 32 poll queues, number isn't really
   that important for single core testing, as long as there's enough to
   have a poll queue local to CPU being tested on. You'll see this in
   dmesg:

   nvme nvme3: 64/0/32 default/read/poll queues

4) Make sure your nvme device is using 'none' as the IO scheduler. I
   think this is a no-brainer, but mentioning it just in case.

5) I turn off iostats and merging for the device. iostats is the most
   important, depending on the platform getting accurate time stamps can
   be expensive:

   echo 0 > /sys/block/nvme3n1/queue/iostats
   echo 2 > /sys/block/nvme3n1/queue/nomerges

6) I do no special CPU frequency tuning. It's running stock settings,
   and the system is not overclocked or anything like that.

I think that's about it. The above gets me 3.5M+ per core using polled
IO and the current kernel, and around 2.3M per core if using IRQ driven
IO. Note that the current kernel is important here, we've improved
things a lot over the last year.

That said, 580K is crazy low, and I bet there's something basic that's
preventing it running faster. Is this a gen2 optane? One thing that
might be useful is to run my t/io_uring from above, it'll tell you what
the IO thread pid is:

[...]
submitter=2900332
[...]

and then run

# perf record -g -p 2900332 -- sleep 3

and afterwards do:

# perf report -g --no-children > output

and gzip the output and attach it here. With performance that low,
should be pretty trivial to figure out what is going on here.

-- 
Jens Axboe

Attachment:
amd-config.txt.gz

Description: application/gzip