The series replaces waitqueues for CQ waiting with a custom waiting loop and adds a couple more perf tweak around it. Benchmarking is done for QD1 with simulated tw arrival right after we start waiting, it gets us from 7.5 MIOPS to 9.2, which is +22%, or double the number for the in-kernel io_uring overhead (i.e. without syscall and userspace). That matches profiles, wake_up() _without_ wake_up_state() was taking 12-14% and prepare_to_wait_exclusive() was around 4-6%. Another 15% reported in the v1 are not there as it got optimised in the meanwhile by 52ea806ad9834 ("io_uring: finish waiting before flushing overflow entries"). So, comparing to a couple of weeks ago the perf of this test case should've jumped more than 30% end-to-end. (Again, spend only half of cycles in io_uring kernel code). 1-8 are preparation patches, they might be taken right away. The rest needs more comments and maybe a little brushing. Pavel Begunkov (13): io_uring: rearrange defer list checks io_uring: don't iterate cq wait fast path io_uring: kill io_run_task_work_ctx io_uring: move defer tw task checks io_uring: parse check_cq out of wq waiting io_uring: mimimise io_cqring_wait_schedule io_uring: simplify io_has_work io_uring: set TASK_RUNNING right after schedule io_uring: separate wq for ring polling io_uring: add lazy poll_wq activation io_uring: wake up optimisations io_uring: waitqueue-less cq waiting io_uring: add io_req_local_work_add wake fast path include/linux/io_uring_types.h | 4 + io_uring/io_uring.c | 194 +++++++++++++++++++++++---------- io_uring/io_uring.h | 35 +++--- 3 files changed, 155 insertions(+), 78 deletions(-) -- 2.38.1