While optimizing my io_uring-based web server, I found that the kernel spends 35% of the CPU time waiting for `io_wq_acct.lock`. This patch set reduces contention of this lock, though I believe much more should be done in order to allow more worker concurrency. I measured these patches with my HTTP server (serving static files and running a tiny PHP script) and with a micro-benchmark that submits millions of `IORING_OP_NOP` entries (with `IOSQE_ASYNC` to force offloading the operation to a worker, so this offload overhead can be measured). Some of the optimizations eliminate memory accesses, e.g. by passing values that are already known to (inlined) functions and by caching values in local variables. These are useful optimizations, but they are too small to measure them in a benchmark (too much noise). Some of the patches have a measurable effect and they contain benchmark numbers that I could reproduce in repeated runs, despite the noise. I'm not confident about the correctness of the last patch ("io_uring: skip redundant poll wakeups"). This seemed like low-hanging fruit, so low that it seemed suspicious to me. If this is a useful optimization, the idea could probably be ported to other wait_queue users, or even into the wait_queue library. What I'm not confident about is whether the optimization is valid or whether it may miss wakeups, leading to stalls. Please advise! Total "perf diff" for `IORING_OP_NOP`: 42.25% -9.24% [kernel.kallsyms] [k] queued_spin_lock_slowpath 4.79% +2.83% [kernel.kallsyms] [k] io_worker_handle_work 7.23% -1.41% [kernel.kallsyms] [k] io_wq_submit_work 6.80% +1.23% [kernel.kallsyms] [k] io_wq_free_work 3.19% +1.10% [kernel.kallsyms] [k] io_req_task_complete 2.45% +0.94% [kernel.kallsyms] [k] try_to_wake_up +0.81% [kernel.kallsyms] [k] io_acct_activate_free_worker 0.79% +0.64% [kernel.kallsyms] [k] __schedule Serving static files with HTTP (send+receive on local+TCP,splice file->pipe->TCP): 42.92% -7.84% [kernel.kallsyms] [k] queued_spin_lock_slowpath 1.53% -1.51% [kernel.kallsyms] [k] ep_poll_callback 1.18% +1.49% [kernel.kallsyms] [k] io_wq_free_work 0.61% +0.60% [kernel.kallsyms] [k] try_to_wake_up 0.76% -0.43% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.22% -0.33% [kernel.kallsyms] [k] io_wq_submit_work Running PHP script (send+receive on local+TCP, splice pipe->TCP): 33.01% -4.13% [kernel.kallsyms] [k] queued_spin_lock_slowpath 1.57% -1.56% [kernel.kallsyms] [k] ep_poll_callback 1.36% +1.19% [kernel.kallsyms] [k] io_wq_free_work 0.94% -0.61% [kernel.kallsyms] [k] _raw_spin_lock_irqsave 2.56% -0.36% [kernel.kallsyms] [k] io_wq_submit_work 2.06% +0.36% [kernel.kallsyms] [k] io_worker_handle_work 1.00% +0.35% [kernel.kallsyms] [k] try_to_wake_up (The `IORING_OP_NOP` benchmark finishes after a hardcoded number of operations; the two HTTP benchmarks finish after a certain wallclock duration, and therefore more HTTP requests were handled.) Max Kellermann (8): io_uring/io-wq: eliminate redundant io_work_get_acct() calls io_uring/io-wq: add io_worker.acct pointer io_uring/io-wq: move worker lists to struct io_wq_acct io_uring/io-wq: cache work->flags in variable io_uring/io-wq: do not use bogus hash value io_uring/io-wq: pass io_wq to io_get_next_work() io_uring: cache io_kiocb->flags in variable io_uring: skip redundant poll wakeups include/linux/io_uring_types.h | 10 ++ io_uring/io-wq.c | 230 +++++++++++++++++++-------------- io_uring/io-wq.h | 7 +- io_uring/io_uring.c | 63 +++++---- io_uring/io_uring.h | 2 +- 5 files changed, 187 insertions(+), 125 deletions(-) -- 2.45.2