For DEFER_TASKRUN rings replace CQ waitqueues with a custom implementation based on the fact that only one task may be waiting for completions. Also, improve deferred task running by removing one atomic in patch 11 Benchmarking QD1 with simulated tw arrival right after we start waiting: 7.5 MIOPS -> 9.3 (+23%), where half of CPU cycles goes to syscall overhead. v2: remove merged cleanups and add new ones add 11/11 removing one extra atomic a small sync adjustment in 10/10 add extra comments Pavel Begunkov (11): io_uring: move submitter_task out of cold cacheline io_uring: refactor io_wake_function io_uring: don't set TASK_RUNNING in local tw runner io_uring: mark io_run_local_work static io_uring: move io_run_local_work_locked io_uring: separate wq for ring polling io_uring: add lazy poll_wq activation io_uring: wake up optimisations io_uring: waitqueue-less cq waiting io_uring: add io_req_local_work_add wake fast path io_uring: optimise deferred tw execution include/linux/io_uring_types.h | 15 +-- io_uring/io_uring.c | 161 ++++++++++++++++++++++++++------- io_uring/io_uring.h | 28 ++---- 3 files changed, 144 insertions(+), 60 deletions(-) -- 2.38.1