tested with fio/t/io_uring nops all batching=32: 24 vs 31.5 MIOPS, or ~30% win WARNING: there is one problem with draining, will fix in v2 There are two parts: 1-14 are about optimising the completion path: - replaces lists with single linked lists - kills 64 * 8B of caches in ctx - adds some shuffling of iopoll bits - list splice instead of per-req list_add in one place - inlines io_req_free_batch() and other helpers 15-22: inlines __io_queue_sqe() so all the submission path up to io_issue_sqe() is inlined + little tweaks Pavel Begunkov (23): io_uring: mark having different creds unlikely io_uring: force_nonspin io_uring: make io_do_iopoll return number of reqs io_uring: use slist for completion batching io_uring: remove allocation cache array io-wq: add io_wq_work_node based stack io_uring: replace list with stack for req caches io_uring: split iopoll loop io_uring: use single linked list for iopoll io_uring: add a helper for batch free io_uring: convert iopoll_completed to store_release io_uring: optimise batch completion io_uring: inline completion batching helpers io_uring: don't pass tail into io_free_batch_list io_uring: don't pass state to io_submit_state_end io_uring: deduplicate io_queue_sqe() call sites io_uring: remove drain_active check from hot path io_uring: split slow path from io_queue_sqe io_uring: inline hot path of __io_queue_sqe() io_uring: reshuffle queue_sqe completion handling io_uring: restructure submit sqes to_submit checks io_uring: kill off ->inflight_entry field io_uring: comment why inline complete calls io_clean_op() fs/io-wq.h | 60 +++++- fs/io_uring.c | 503 +++++++++++++++++++++++--------------------------- 2 files changed, 283 insertions(+), 280 deletions(-) -- 2.33.0