Re: Question: t/io_uring performance

Jens Axboe <axboe@xxxxxxxxx> · Wed, 8 Sep 2021 06:41:48 -0600

On 9/8/21 6:22 AM, Jens Axboe wrote:
> On 9/8/21 5:53 AM, Sitsofe Wheeler wrote:
>> (CC'ing Jens directly in case he missed the previous messages)
>>
>> On Mon, 6 Sept 2021 at 15:28, Hans-Peter Lehmann
>> <hans-peter.lehmann@xxxxxxx> wrote:
>>>
>>> Hi Jens,
>>>
>>> not sure if you have read the emails in this thread - now trying to address you directly. Both Erwan and me are unable to reproduce your single-threaded IOPS measurements - we don't even get close to your numbers. The bottle-neck seems to be the CPU, not the SSDs. Did you use some special configuration for your benchmarks?
>>>
>>> Best regards
>>> Hans-Peter
>>>
>>> (I have also reproduced the behavior with an Intel processor now - the single-threaded throughput is also capped at around 580k IOPS, even though the SSDs can handle more than that when using multiple threads)
> 
> Thanks for CC'ing me, I don't always see the messages otherwise. 580K is
> very low, but without having access to the system and being able to run
> some basic profiling, hard for me to say what you're running into. I may
> miss some details in the below, so please do ask followups if things are
> missing/unclear.
> 
> 1) I'm using a 3970X with a desktop board + box for my peak testing,
>    specs on that can be found online.
> 
> 2) Yes I do run a custom configuration on my kernel, I do kernel
>    development after all :-). I'm attaching the one I'm using. This
>    hasn't changed in a long time. I do turn off various things that I
>    don't need and some of them do impact performance.
> 
> 3) The options I run t/io_uring with have been posted multiple times,
>    it's this one:
> 
>    taskset -c 0  t/io_uring -b512 -d128 -c32 -s32 -p1 -F1 -B1 /dev/nvme3n1
> 
>    which is QD=128, 32/32 submit/complete batching, polled IO,
>    registered files and buffers. Note that you'll need to configure NVMe
>    to properly use polling. I use 32 poll queues, number isn't really
>    that important for single core testing, as long as there's enough to
>    have a poll queue local to CPU being tested on. You'll see this in
>    dmesg:
> 
>    nvme nvme3: 64/0/32 default/read/poll queues
> 
> 4) Make sure your nvme device is using 'none' as the IO scheduler. I
>    think this is a no-brainer, but mentioning it just in case.
> 
> 5) I turn off iostats and merging for the device. iostats is the most
>    important, depending on the platform getting accurate time stamps can
>    be expensive:
> 
>    echo 0 > /sys/block/nvme3n1/queue/iostats
>    echo 2 > /sys/block/nvme3n1/queue/nomerges
> 
> 6) I do no special CPU frequency tuning. It's running stock settings,
>    and the system is not overclocked or anything like that.
> 
> I think that's about it. The above gets me 3.5M+ per core using polled
> IO and the current kernel, and around 2.3M per core if using IRQ driven
> IO. Note that the current kernel is important here, we've improved
> things a lot over the last year.
> 
> That said, 580K is crazy low, and I bet there's something basic that's
> preventing it running faster. Is this a gen2 optane? One thing that
> might be useful is to run my t/io_uring from above, it'll tell you what
> the IO thread pid is:
> 
> [...]
> submitter=2900332
> [...]
> 
> and then run
> 
> # perf record -g -p 2900332 -- sleep 3
> 
> and afterwards do:
> 
> # perf report -g --no-children > output
> 
> and gzip the output and attach it here. With performance that low,
> should be pretty trivial to figure out what is going on here.

Followup - the below is specific to my peak-per-core testing, for
running much lower IOPS most of it isn't going to be required. For
example, polled IO is not going to be that useful at ~500K iops.

For the original poster, and I think this was already asked, but please
run the perf as indicated and also do a run with two threads:

taskset -c 0,1 t/io_uring -b512 -d128 -c32 -s32 -p0 -F1 -B1 -n2 /dev/nvmeXn1 /dev/nvmeYn1

just to see what happens. t/io_uring doesn't work very well with
driving polled IO for multiple devices, it's just a simple little
IO generator, nothing advanced.

I picked CPU0 and 1 here, but depending on the number of queues on your
device, you might be more limited and you should pick something that
causes a nice spread on your setup. /sys/kernel/debug/block/<dev> will
have information on how queues are spread out.

Again, not really something that should be needed at these kinds of
rates, unless the device is severly queue starved and you have a lot of
cores in your system. Regardless, strictly affinitizing the workload
helps with variance between runs and is always a good idea.

-- 
Jens Axboe