Re: IORING_OP_POLL_ADD slower than linux-aio IOCB_CMD_POLL

Avi Kivity <avi@xxxxxxxxxxxx> · Tue, 19 Apr 2022 14:57:35 +0300

On 19/04/2022 14.38, Jens Axboe wrote:
On 4/19/22 5:07 AM, Avi Kivity wrote:
A simple webserver shows about 5% loss compared to linux-aio.

I expect the loss is due to an optimization that io_uring lacks -
inline completion vs workqueue completion:
I don't think that's it, io_uring never punts to a workqueue for
completions.

I measured this:

 Performance counter stats for 'system wide':

         1,273,756 io_uring:io_uring_task_add

      12.288597765 seconds time elapsed

Which exactly matches with the number of requests sent. If that's the 
wrong counter to measure, I'm happy to try again with the correct counter.

  The aio inline completions is more of a hack because it
needs to do that, as always using a workqueue would lead to bad
performance and higher overhead.

So if there's a difference in performance, it's something else and we
need to look at that. But your report is pretty lacking! What kernel are
you running?

5.17.2-300.fc36.x86_64

Do you have a test case of sorts?

Seastar's httpd, running on a single core, against wrk -c 1000 -t 4 
http://localhost:10000/.

Instructions:

  git clone --recursive -b io_uring https://github.com/avikivity/seastar

  cd seastar

  sudo ./install-dependencies.sh  # after carefully verifying it, of course

  ./configure.py --mode release

  ninja -C build/release apps/httpd/httpd

  ./build/release/apps/httpd/httpd --smp 1 [--reactor-backing 
io_uring|linux-aio|epoll]

and run wrk againt it.

For a performance oriented network setup, I'd normally not consider data
readiness poll replacements to be that interesting, my recommendation
would be to use async send/recv for that instead. That's how io_uring is
supposed to be used, in a completion based model.

That's true. Still, an existing system that evolved around poll will 
take some time and effort to migrate, and have slower IORING_OP_POLL 
means it cannot benefit from io_uring's many other advantages if it 
fears a regression from that difference.

Note that it's not just a matter of converting poll+recvmsg to 
IORING_OP_RECVMSG. If you support many connections, one must migrate to 
internal buffer selection, otherwise the memory load with a large number 
of idle connections is high. The end result is wonderful but the road 
there is long.