Hi, While doing testing on zero-copy tx/rx with io_uring, I noticed that a LOT of time is being spent copying in data structures related to waiting on events. While the structures aren't big (24b and 16b), it's a user copy, and a dependent user copy on top of that - first get the main struct io_uring_getevents_arg, and then copy in the timeout from that. Details are in patch 3 on the numbers. I then toyed around with the idea of having registered regions of wait data. This can be set by the application, and then directly seen by the kernel. On top of that, embed the timeout itself in this region, rather than it being supplied as an out-of-band pointer. Patch 1 is just a generic cleanup, and patch 2 improves the copying a both. Both of these can stand alone. Patch 3 implements the meat of this, which adds IORING_REGISTER_CQWAIT_REG, allowing an application to register a number of struct io_uring_reg_wait regions that it can index for wait scenarios. The kernel always registers a full page, so on 4k page size archs, 64 regions are available by default. Then rather than pass in a pointer to a struct io_uring_getevents_arg, an index is passed instead, telling the kernel which registered wait region should be used for this wait. This basically removes all of the copying seen, which was anywhere from 3.5-4.5% of the time, when doing high frequency operations where the number of wait invocations can be quite high. Patches sit on top of the io_uring-ring-resize branch, as both end up adding register opcodes. Kernel branch here: https://git.kernel.dk/cgit/linux/log/?h=io_uring-reg-wait and liburing (pretty basic right now) branch here: https://git.kernel.dk/cgit/liburing/log/?h=reg-wait include/linux/io_uring_types.h | 7 +++ include/uapi/linux/io_uring.h | 18 ++++++ io_uring/io_uring.c | 102 ++++++++++++++++++++++++++------- io_uring/register.c | 48 ++++++++++++++++ 4 files changed, 153 insertions(+), 22 deletions(-) -- Jens Axboe