Hi Linus, Lots of cleanups in here, hardening the code and/or making it easier to read and fixing buts, but a core feature/change too adding support for real async buffered reads. With the latter in place, we just need buffered write async support and we're done relying on kthreads for the fast path. In detail: - Cleanup how memory accounting is done on ring setup/free (Bijan) - sq array offset calculation fixup (Dmitry) - Consistently handle blocking off O_DIRECT submission path (me) - Support proper async buffered reads, instead of relying on kthread offload for that. This uses the page waitqueue to drive retries from task_work, like we handle poll based retry. (me) - IO completion optimizations (me) - Fix race with accounting and ring fd install (me) - Support EPOLLEXCLUSIVE (Jiufei) - Get rid of the io_kiocb unionizing, made possible by shrinking other bits (Pavel) - Completion side cleanups (Pavel) - Cleanup REQ_F_ flags handling, and kill off many of them (Pavel) - Request environment grabbing cleanups (Pavel) - File and socket read/write cleanups (Pavel) - Improve kiocb_set_rw_flags() (Pavel) - Tons of fixes and cleanups (Pavel) - IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang) This will throw a few merge conflicts. One is due to the IOCB_NOIO addition that happened late in 5.8-rc, the other is due to a change in for-5.9/block. Both are trivial to fixup, I'm attaching my merge resolution when I pulled it in locally. Please pull! The following changes since commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5: io_uring: fix lockup in io_fail_links() (2020-07-24 12:51:33 -0600) are available in the Git repository at: git://git.kernel.dk/linux-block.git tags/for-5.9/io_uring-20200802 for you to fetch changes up to fa15bafb71fd7a4d6018dae87cfaf890fd4ab47f: io_uring: flip if handling after io_setup_async_rw (2020-08-01 11:02:57 -0600) ---------------------------------------------------------------- for-5.9/io_uring-20200802 ---------------------------------------------------------------- Bijan Mottahedeh (4): io_uring: add wrappers for memory accounting io_uring: rename ctx->account_mem field io_uring: report pinned memory usage io_uring: separate reporting of ring pages from registered pages Dan Carpenter (1): io_uring: fix a use after free in io_async_task_func() Dmitry Vyukov (1): io_uring: fix sq array offset calculation Jens Axboe (31): block: provide plug based way of signaling forced no-wait semantics io_uring: always plug for any number of IOs io_uring: catch -EIO from buffered issue request failure io_uring: re-issue block requests that failed because of resources mm: allow read-ahead with IOCB_NOWAIT set mm: abstract out wake_page_match() from wake_page_function() mm: add support for async page locking mm: support async buffered reads in generic_file_buffered_read() fs: add FMODE_BUF_RASYNC block: flag block devices as supporting IOCB_WAITQ xfs: flag files as supporting buffered async reads btrfs: flag files as supporting buffered async reads mm: add kiocb_wait_page_queue_init() helper io_uring: support true async buffered reads, if file provides it Merge branch 'async-buffered.8' into for-5.9/io_uring io_uring: provide generic io_req_complete() helper io_uring: add 'io_comp_state' to struct io_submit_state io_uring: pass down completion state on the issue side io_uring: pass in completion state to appropriate issue side handlers io_uring: enable READ/WRITE to use deferred completions io_uring: use task_work for links if possible Merge branch 'io_uring-5.8' into for-5.9/io_uring io_uring: clean up io_kill_linked_timeout() locking Merge branch 'io_uring-5.8' into for-5.9/io_uring io_uring: abstract out task work running io_uring: use new io_req_task_work_add() helper throughout io_uring: only call kfree() for a non-zero pointer io_uring: get rid of __req_need_defer() io_uring: remove dead 'ctx' argument and move forward declaration Merge branch 'io_uring-5.8' into for-5.9/io_uring io_uring: don't touch 'ctx' after installing file descriptor Jiufei Xue (2): io_uring: change the poll type to be 32-bits io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior Pavel Begunkov (90): io_uring: remove setting REQ_F_MUST_PUNT in rw io_uring: remove REQ_F_MUST_PUNT io_uring: set @poll->file after @poll init io_uring: kill NULL checks for submit state io_uring: fix NULL-mm for linked reqs io-wq: compact io-wq flags numbers io-wq: return next work from ->do_work() directly io_uring: fix req->work corruption io_uring: fix punting req w/o grabbed env io_uring: fix feeding io-wq with uninit reqs io_uring: don't mark link's head for_async io_uring: fix missing io_grab_files() io_uring: fix refs underflow in io_iopoll_queue() io_uring: remove inflight batching in free_many() io_uring: dismantle req early and remove need_iter io_uring: batch-free linked requests as well io_uring: cosmetic changes for batch free io_uring: kill REQ_F_LINK_NEXT io_uring: clean up req->result setting by rw io_uring: do task_work_run() during iopoll io_uring: fix iopoll -EAGAIN handling io_uring: fix missing wake_up io_rw_reissue() io_uring: deduplicate freeing linked timeouts io_uring: replace find_next() out param with ret io_uring: kill REQ_F_TIMEOUT io_uring: kill REQ_F_TIMEOUT_NOSEQ io_uring: fix potential use after free on fallback request free io_uring: don't pass def into io_req_work_grab_env io_uring: do init work in grab_env() io_uring: factor out grab_env() from defer_prep() io_uring: do grab_env() just before punting io_uring: don't fail iopoll requeue without ->mm io_uring: fix NULL mm in io_poll_task_func() io_uring: simplify io_async_task_func() io_uring: optimise io_req_find_next() fast check io_uring: fix missing ->mm on exit io_uring: fix mis-refcounting linked timeouts io_uring: keep queue_sqe()'s fail path separately io_uring: fix lost cqe->flags io_uring: don't delay iopoll'ed req completion io_uring: fix stopping iopoll'ing too early io_uring: briefly loose locks while reaping events io_uring: partially inline io_iopoll_getevents() io_uring: remove nr_events arg from iopoll_check() io_uring: don't burn CPU for iopoll on exit io_uring: rename sr->msg into umsg io_uring: use more specific type in rcv/snd msg cp io_uring: extract io_sendmsg_copy_hdr() io_uring: replace rw->task_work with rq->task_work io_uring: simplify io_req_map_rw() io_uring: add a helper for async rw iovec prep io_uring: follow **iovec idiom in io_import_iovec io_uring: share completion list w/ per-op space io_uring: rename ctx->poll into ctx->iopoll io_uring: use inflight_entry list for iopoll'ing io_uring: use completion list for CQ overflow io_uring: add req->timeout.list io_uring: remove init for unused list io_uring: use non-intrusive list for defer io_uring: remove sequence from io_kiocb io_uring: place cflags into completion data io_uring: inline io_req_work_grab_env() io_uring: remove empty cleanup of OP_OPEN* reqs io_uring: alloc ->io in io_req_defer_prep() io_uring/io-wq: move RLIMIT_FSIZE to io-wq io_uring: simplify file ref tracking in submission state io_uring: indent left {send,recv}[msg]() io_uring: remove extra checks in send/recv io_uring: don't forget cflags in io_recv() io_uring: free selected-bufs if error'ed io_uring: move BUFFER_SELECT check into *recv[msg] io_uring: extract io_put_kbuf() helper io_uring: don't open-code recv kbuf managment io_uring: don't miscount pinned memory io_uring: return locked and pinned page accounting tasks: add put_task_struct_many() io_uring: batch put_task_struct() io_uring: don't do opcode prep twice io_uring: deduplicate io_grab_files() calls io_uring: mark ->work uninitialised after cleanup io_uring: fix missing io_queue_linked_timeout() io-wq: update hash bits io_uring: de-unionise io_kiocb io_uring: deduplicate __io_complete_rw() io_uring: fix racy overflow count reporting io_uring: fix stalled deferred requests io_uring: consolidate *_check_overflow accounting io_uring: get rid of atomic FAA for cq_timeouts fs: optimise kiocb_set_rw_flags() io_uring: flip if handling after io_setup_async_rw Randy Dunlap (1): io_uring: fix function args for !CONFIG_NET Xiaoguang Wang (1): io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works block/blk-core.c | 6 + fs/block_dev.c | 2 +- fs/btrfs/file.c | 2 +- fs/io-wq.c | 14 +- fs/io-wq.h | 11 +- fs/io_uring.c | 2588 +++++++++++++++++++++++------------------ fs/xfs/xfs_file.c | 2 +- include/linux/blkdev.h | 1 + include/linux/fs.h | 26 +- include/linux/pagemap.h | 75 ++ include/linux/sched/task.h | 6 + include/uapi/linux/io_uring.h | 4 +- mm/filemap.c | 110 +- tools/io_uring/liburing.h | 6 +- 14 files changed, 1658 insertions(+), 1195 deletions(-) -- Jens Axboe
commit 32a5169a5562db6a09a2d85164e0079913ecc227 Merge: 5fb023fb414a fa15bafb71fd Author: Jens Axboe <axboe@xxxxxxxxx> Date: Sun Aug 2 10:43:35 2020 -0600 Merge branch 'for-5.9/io_uring' into test * for-5.9/io_uring: (127 commits) io_uring: flip if handling after io_setup_async_rw fs: optimise kiocb_set_rw_flags() io_uring: don't touch 'ctx' after installing file descriptor io_uring: get rid of atomic FAA for cq_timeouts io_uring: consolidate *_check_overflow accounting io_uring: fix stalled deferred requests io_uring: fix racy overflow count reporting io_uring: deduplicate __io_complete_rw() io_uring: de-unionise io_kiocb io-wq: update hash bits io_uring: fix missing io_queue_linked_timeout() io_uring: mark ->work uninitialised after cleanup io_uring: deduplicate io_grab_files() calls io_uring: don't do opcode prep twice io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works io_uring: batch put_task_struct() tasks: add put_task_struct_many() io_uring: return locked and pinned page accounting io_uring: don't miscount pinned memory io_uring: don't open-code recv kbuf managment ... Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> diff --cc block/blk-core.c index 93104c7470e8,62a4904db921..d9d632639bd1 --- a/block/blk-core.c +++ b/block/blk-core.c @@@ -956,13 -952,30 +956,18 @@@ static inline blk_status_t blk_check_zo return BLK_STS_OK; } -static noinline_for_stack bool -generic_make_request_checks(struct bio *bio) +static noinline_for_stack bool submit_bio_checks(struct bio *bio) { - struct request_queue *q; - int nr_sectors = bio_sectors(bio); + struct request_queue *q = bio->bi_disk->queue; blk_status_t status = BLK_STS_IOERR; + struct blk_plug *plug; - char b[BDEVNAME_SIZE]; might_sleep(); - q = bio->bi_disk->queue; - if (unlikely(!q)) { - printk(KERN_ERR - "generic_make_request: Trying to access " - "nonexistent block-device %s (%Lu)\n", - bio_devname(bio, b), (long long)bio->bi_iter.bi_sector); - goto end_io; - } - + plug = blk_mq_plug(q, bio); + if (plug && plug->nowait) + bio->bi_opf |= REQ_NOWAIT; + /* * For a REQ_NOWAIT based request, return -EOPNOTSUPP * if queue is not a request based queue. diff --cc include/linux/fs.h index 41cd993ec0f6,e535543d31d9..b7f1f1b7d691 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@@ -315,7 -318,8 +318,9 @@@ enum rw_hint #define IOCB_SYNC (1 << 5) #define IOCB_WRITE (1 << 6) #define IOCB_NOWAIT (1 << 7) + /* iocb->ki_waitq is valid */ + #define IOCB_WAITQ (1 << 8) +#define IOCB_NOIO (1 << 9) struct kiocb { struct file *ki_filp; diff --cc mm/filemap.c index 385759c4ce4b,a5b1fa8f7ce4..4e39c1f4c7d9 --- a/mm/filemap.c +++ b/mm/filemap.c @@@ -2028,8 -2044,6 +2044,8 @@@ find_page page = find_get_page(mapping, index); if (!page) { - if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) ++ if (iocb->ki_flags & IOCB_NOIO) + goto would_block; page_cache_sync_readahead(mapping, ra, filp, index, last_index - index); @@@ -2164,7 -2185,7 +2191,7 @@@ page_not_up_to_date_locked } readpage: - if (iocb->ki_flags & IOCB_NOIO) { - if (iocb->ki_flags & IOCB_NOWAIT) { ++ if (iocb->ki_flags & (IOCB_NOWAIT | IOCB_NOIO)) { unlock_page(page); put_page(page); goto would_block;