Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL

Avi Kivity <avi@xxxxxxxxxxxx> · Wed, 15 Jun 2022 14:36:48 +0300

On 15/06/2022 14.30, Pavel Begunkov wrote:
On 6/15/22 12:04, Avi Kivity wrote:

On 15/06/2022 13.48, Pavel Begunkov wrote:
On 6/15/22 11:12, Avi Kivity wrote:

On 19/04/2022 20.14, Jens Axboe wrote:
On 4/19/22 9:21 AM, Jens Axboe wrote:
On 4/19/22 6:31 AM, Jens Axboe wrote:
On 4/19/22 6:21 AM, Avi Kivity wrote:
On 19/04/2022 15.04, Jens Axboe wrote:
On 4/19/22 5:57 AM, Avi Kivity wrote:
On 19/04/2022 14.38, Jens Axboe wrote:
On 4/19/22 5:07 AM, Avi Kivity wrote:
A simple webserver shows about 5% loss compared to linux-aio.

I expect the loss is due to an optimization that io_uring 
lacks -
inline completion vs workqueue completion:
I don't think that's it, io_uring never punts to a workqueue 
for
completions.
I measured this:

   Performance counter stats for 'system wide':

           1,273,756 io_uring:io_uring_task_add

        12.288597765 seconds time elapsed

Which exactly matches with the number of requests sent. If 
that's the
wrong counter to measure, I'm happy to try again with the 
correct
counter.
io_uring_task_add() isn't a workqueue, it's task_work. So that is
expected.
Might actually be implicated. Not because it's a async worker, but
because I think we might be losing some affinity in this case. 
Looking
at traces, we're definitely bouncing between the poll completion 
side
and then execution the completion.

Can you try this hack? It's against -git + for-5.19/io_uring. If 
you let
me know what base you prefer, I can do a version against that. I see
about a 3% win with io_uring with this, and was slower before 
against
linux-aio as you saw as well.
Another thing to try - get rid of the IPI for TWA_SIGNAL, which I
believe may be the underlying cause of it.

Resurrecting an old thread. I have a question about timeliness of 
completions. Let's assume a request has completed. From the patch, 
it appears that io_uring will only guarantee that a completion 
appears on the completion ring if the thread has entered kernel 
mode since the completion happened. So user-space polling of the 
completion ring can cause unbounded delays.

Right, but polling the CQ is a bad pattern, 
io_uring_{wait,peek}_cqe/etc.
will do the polling vs syscalling dance for you.

Can you be more explicit?

I don't think peek is enough. If there is a cqe pending, it will 
return it, but will not cause compeleted-but-unqueued events to 
generate completions.

And wait won't enter the kernel if a cqe is pending, IIUC.

Right, usually it won't, but works if you eventually end up
waiting, e.g. by waiting for all expected cqes.

For larger audience, I'll remind that it's an opt-in feature

I don't understand - what is an opt-in feature?

The behaviour that you worry about when CQEs are not posted until
you do syscall, it's only so if you set IORING_SETUP_COOP_TASKRUN.

Ah! I wasn't aware of this new flag. This is exactly what I want - 
either ask for timely completions, or optimize for throughput.

Of course, it puts me in a dilemma because I want both, but that's my 
problem.

Thanks!