[PATCHSET 0/5] Support for polled aio

Jens Axboe <axboe@xxxxxxxxx> · Sat, 17 Nov 2018 16:53:12 -0700

Up until now, IO polling has been exclusively available through preadv2
and pwrite2, both fully synchronous interfaces. This works fine for
completely synchronous use cases, but that's about it. If QD=1 wasn't
enough read the performance goals, the only alternative was to increase
the thread count. Unfortunately, that isn't very efficient, both in
terms of CPU utilization (each thread will use 100% of CPU time) and in
terms of achievable performance.

With all of the recent advances in polling (non-irq polling, efficiency
gains, multiple pollable queues, etc), it's now feasible to add polling
support to aio - this patchset just does that.

An iocb flag is added, IOCB_FLAG_HIPRI, similarly to how we have
RWF_HIPRI for preadv2/pwritev2. It's applicable to the commands that
read/write data, like IOCB_CMD_PREAD/IOCB_CMD_PWRITE and the vectored
variants.  Submission works the same as before.

The polling happens off io_getevents(), when the application is looking
for completions. That also works like before, with the only difference
being that events aren't waited for, they are actively found and polled
on the device side.

The only real difference in terms of completions is that polling does
NOT use the libaio user exposed ring. This is just not feasible, as the
application needs to be the one that actively polls for the events.
Because of this, that's not supported with polling, and the internals
completely ignore the ring.

Outside of that, it's illegal to mix polled with non-polled IO on the
same io_context. There's no way to setup an io_context with the
information that we will be polling on it (always add flags to new
syscalls...), hence we need to track this internally. For polled IO, we
can never wait for events, we have to actively find them. I didn't want
to add counters to the io_context to inc/dec for each IO, so I just made
this illegal. If an application attempts to submit both polled and
non-polled IO on the same io_context, it will get an -EINVAL return at
io_submit() time.

Performance results have been very promising. For an internal Facebook
flash storage device, we're getting 20% increase in performance, with an
identical reduction in latencies. Notably, this is testing a highly
tuned setup to just turning on polling. I'm sure there's still extra
room for performance there. Note that at these speeds and feeds, the
polling ends up NOT using more CPU time than we did without polling!

On that same box, I ran microbenchmarks, and was able to increase peak
performance 25%. The box was pegged at around 2.4M IOPS, with just
turning on polling, the bandwidth was maxed out at 12.5GB/sec doing 3.2M
IOPS. All of this with 2 millions LESS interrupts/seconds, and 2M+ less
context switches.

In terms of efficiency, a tester was able to get 800K+ IOPS out of a
_single_ thread at QD=16 on a device. These kinds of results are just
unheard of in terms of efficiency.

You can find this code in my aio-poll branch, and that branch (and
these patches) are on top of my mq-perf branch.

 fs/aio.c                     | 495 ++++++++++++++++++++++++++++++++---
 fs/block_dev.c               |   2 +
 fs/direct-io.c               |   4 +-
 fs/iomap.c                   |   7 +-
 include/linux/fs.h           |   1 +
 include/uapi/linux/aio_abi.h |   2 +
 6 files changed, 478 insertions(+), 33 deletions(-)

-- 
Jens Axboe