io_uring extensively uses task_work, but when a task is waiting every new queued task_work batch will try to wake it up and so cause lots of scheduling activity. This series optimises it, specifically applied for rw completions and send-zc notifications for now, and will helpful for further optimisations. Quick testing shows similar to v1 results, numbers from v1: For my zc net test once in a while waiting for a portion of buffers I've got 10x descrease in the number of context switches and 2x improvement in CPU util (17% vs 8%). In profiles, io_cqring_work() got down from 40-50% of CPU to ~13%. There is also an improvement on the softirq side for io_uring notifications as io_req_local_work_add() doesn't trigger wake_up() as often. System wide profiles show reduction of cycles taken by io_req_local_work_add() from 3% -> 0.5%, which is mostly not reflected in the numbers above as it was firing off of a different CPU. v2: Remove atomics decrements by the queueing side and instead carry all the info in requests. It's definitely simpler and removes extra atomics, the downside is touching the previous request, which might be not cached. Add a couple of patches from backlog optimising and cleaning io_req_local_work_add(). Pavel Begunkov (8): io_uring: move pinning out of io_req_local_work_add io_uring: optimie local tw add ctx pinning io_uring: refactor __io_cq_unlock_post_flush() io_uring: add tw add flags io_uring: inline llist_add() io_uring: reduce scheduling due to tw io_uring: refactor __io_cq_unlock_post_flush() io_uring: optimise io_req_local_work_add include/linux/io_uring_types.h | 3 +- io_uring/io_uring.c | 131 ++++++++++++++++++++++----------- io_uring/io_uring.h | 29 +++++--- io_uring/notif.c | 2 +- io_uring/notif.h | 2 +- io_uring/rw.c | 2 +- 6 files changed, 110 insertions(+), 59 deletions(-) -- 2.40.0