Here's v3 of the io_uring interface. Since data structures etc have changed since the v1 posting, here's a refresher of what io_uring is and how it works. io_uring is a submission queue (SQ) and completion queue (CQ) pair that an application can use to communicate with the kernel for doing IO. This isn't aio/libaio, but it provides a similar set of features, as well as some new ones: - io_uring is a lot more efficient than aio. A lot, and in many ways. - io_uring supports buffered aio. Not just that, but efficiently as well. Cached data isn't punted to an async context. - io_uring supports polled IO, it takes advantage of the blk-mq polling work that went into 5.0-rc. - io_uring supports kernel side submissions for polled IO. This enables IO without ever having to do a system call. - io_uring supports fixed buffers for O_DIRECT. Buffers can be registered after an io_uring context has been setup, which eliminates the need to do get_user_pages() / put_pages() for each and every IO. To use io_uring, you must first setup an io_uring context. This is done through the first of three new system calls: io_uring_setup(entries, params) Sets up a context for doing async IO. On success, returns a file descriptor that the application can mmap to gain access to the SQ ring, CQ ring, and io_uring_sqe's. Once the rings are setup, the application then mmap's these rings to communicate with the kernel. See a sample application I wrote that natively does this: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c IO is done by filling out an io_uring_sqe, and updating the SQ ring. The format of the sqe is as follows: struct io_uring_sqe { __u8 opcode; /* type of operation for this sqe */ __u8 flags; /* IOSQE_ flags */ __u16 ioprio; /* ioprio for the request */ __s32 fd; /* file descriptor to do IO on */ __u64 off; /* offset into file */ union { void *addr; /* buffer or iovecs */ __u64 __pad; }; __u32 len; /* buffer size or number of iovecs */ union { __kernel_rwf_t rw_flags; __u32 fsync_flags; }; __u16 buf_index; /* index into fixed buffers, if used */ __u16 __pad2; __u32 __pad3; __u64 user_data; /* data to be passed back at completion time */ }; Most of this is self explanatory. The ->user_data field is passed back through a completion event, so the application can track IOs individually. Completions are posted on the CQ ring when an sqe completes, they are a struct io_uring_cqe and the format is as follows: struct io_uring_cqe { __u64 user_data; /* sqe->data submission passed back */ __s32 res; /* result code for this event */ __u32 flags; }; To either submit IO or reap completions, there's a 2nd new system call: io_uring_enter(fd, to_submit, min_complete, flags) Initiates IO against the rings mapped to this fd, or waits for them to complete, or both The behavior is controlled by the parameters passed in. If 'min_complete' is non-zero, then we'll try and submit new IO. If IORING_ENTER_GETEVENTS is set, the kernel will wait for 'min_complete' events, if they aren't already available. The sample application mentioned above uses the rings directly, but for most uses cases, I intend to have the necessary support in a liburing library that abstracts it enough for application to use in a performant way, without having to deal with the intricacies of the ring. There's already some basic support there and a few test applications, but that side definitely needs some work. Find that repo here: git://git.kernel.dk/liburing io_uring is designed to be fast and scalable. I've demonstrated 1.6M 4k IOPS from a single core on my aging test box, and on the latency front, we're also doing extremely well. It's designed to both be async and batching, if you wish, the application gets to control how to use that side. If you want to play with io_uring, see the sample app above, the liburing repo, or the fio io_uring engine as well. Patches are against 5.0-rc1 (ish), and can also be found in my 'io_uring' git branch: git://git.kernel.dk/linux-block io_uring Since v2 - Separate fixed buffers from sqe entries. register/unregister them through the new io_uring_register(2) system call - sqe->index is now sqe->buf_index to make it clearer - fixed buffers require sqe->flags to have IOSQE_FIXED_BUFFER set - Add sqe field that is passed back at completion through the cqe, instead of passing back the original sqe index. This is more useful as it allows per-life of IO data, ->index did not. - Cleanup async IO punting - Don't punt O_DIRECT writes to async handling - Make sq thread just for polling (submissions and completions) - Always enable sq workqueue for async offload - Use GFP_ATOMIC for req allocation - Fix bio_vec being an unknown type on some kconfigs - New IORING_OP_FSYNC implementation - Add fixed fileset support through io_uring_register(2) - Integrate workqueue support into main patchset - Fix io_sq_thread() logic for when to grab current->mm - Fix io_sq_thread() off-by-one - Improve polling performance for multiple files in an io_uring context - Have CONFIG_IO_URING select ANON_INODES - Don't make io_kiocb->ki_flags atomic - Be fully consistent in naming, for some reason we used the same mess that aio.c is, where io_kiocb,kiocb,iocb are used interchangably. 'req' is now always io_kiocb, 'kiocb' is always kiocb. - Rename KIOCB_F_* flags as they are req flags, REQ_F_*. Documentation/filesystems/vfs.txt | 3 + arch/x86/entry/syscalls/syscall_64.tbl | 3 + block/bio.c | 59 +- fs/Makefile | 1 + fs/block_dev.c | 19 +- fs/file.c | 15 +- fs/file_table.c | 9 +- fs/gfs2/file.c | 2 + fs/io_uring.c | 2023 ++++++++++++++++++++++++ fs/iomap.c | 48 +- fs/xfs/xfs_file.c | 1 + include/linux/bio.h | 14 + include/linux/blk_types.h | 1 + include/linux/file.h | 2 + include/linux/fs.h | 6 +- include/linux/iomap.h | 1 + include/linux/syscalls.h | 7 + include/uapi/linux/io_uring.h | 147 ++ init/Kconfig | 9 + kernel/sys_ni.c | 3 + 20 files changed, 2334 insertions(+), 39 deletions(-) -- Jens Axboe