Up until now, IO polling has been exclusively available through preadv2 and pwrite2, both fully synchronous interfaces. This works fine for completely synchronous use cases, but that's about it. If QD=1 wasn't enough read the performance goals, the only alternative was to increase the thread count. Unfortunately, that isn't very efficient, both in terms of CPU utilization (each thread will use 100% of CPU time) and in terms of achievable performance. With all of the recent advances in polling (non-irq polling, efficiency gains, multiple pollable queues, etc), it's now feasible to add polling support to aio - this patchset just does that. An iocb flag is added, IOCB_FLAG_HIPRI, similarly to how we have RWF_HIPRI for preadv2/pwritev2. It's applicable to the commands that read/write data, like IOCB_CMD_PREAD/IOCB_CMD_PWRITE and the vectored variants. Submission works the same as before. The polling happens off io_getevents(), when the application is looking for completions. That also works like before, with the only difference being that events aren't waited for, they are actively found and polled on the device side. The only real difference in terms of completions is that polling does NOT use the libaio user exposed ring. This is just not feasible, as the application needs to be the one that actively polls for the events. Because of this, that's not supported with polling, and the internals completely ignore the ring. Outside of that, it's illegal to mix polled with non-polled IO on the same io_context. There's no way to setup an io_context with the information that we will be polling on it (always add flags to new syscalls...), hence we need to track this internally. For polled IO, we can never wait for events, we have to actively find them. I didn't want to add counters to the io_context to inc/dec for each IO, so I just made this illegal. If an application attempts to submit both polled and non-polled IO on the same io_context, it will get an -EINVAL return at io_submit() time. Performance results have been very promising. For an internal Facebook flash storage device, we're getting 20% increase in performance, with an identical reduction in latencies. Notably, this is testing a highly tuned setup to just turning on polling. I'm sure there's still extra room for performance there. Note that at these speeds and feeds, the polling ends up NOT using more CPU time than we did without polling! On that same box, I ran microbenchmarks, and was able to increase peak performance 25%. The box was pegged at around 2.4M IOPS, with just turning on polling, the bandwidth was maxed out at 12.5GB/sec doing 3.2M IOPS. All of this with 2 millions LESS interrupts/seconds, and 2M+ less context switches. In terms of efficiency, a tester was able to get 800K+ IOPS out of a _single_ thread at QD=16 on a device. These kinds of results are just unheard of in terms of efficiency. You can find this code in my aio-poll branch, and that branch (and these patches) are on top of my mq-perf branch. fs/aio.c | 495 ++++++++++++++++++++++++++++++++--- fs/block_dev.c | 2 + fs/direct-io.c | 4 +- fs/iomap.c | 7 +- include/linux/fs.h | 1 + include/uapi/linux/aio_abi.h | 2 + 6 files changed, 478 insertions(+), 33 deletions(-) -- Jens Axboe