fio/t/io_uring -s32 -d32 -c32 -N1 | baseline | 0-15 | 0-16 | diff setup 1: | 34 MIOPS | 42 MIOPS | 42.2 MIOPS | 25 % setup 2: | 31 MIOPS | 31 MIOPS | 32 MIOPS | ~3 $ Setup 1 gets 25% performance improvement, which is unexpected and a share of it should be accounted as compiler/HW magic. Setup 2 is just 3%, but the catch is that some of the patches _very_ unexpectedly sink performance, so it's more like 31 MIOPS -> 29 -> 30 -> 29 -> 31 -> 32 I'd suggest to leave 16/16 aside, maybe for future consideration and refinement. The end result is not very clear, I'd expect probably around 3-5% with a more stable setup for nops32, and a better win for io_cqring_ev_posted() intensive cases like BPF. Pavel Begunkov (16): io_uring: optimise kiocb layout io_uring: add more likely/unlikely() annotations io_uring: delay req queueing into compl-batch list io_uring: optimise request allocation io_uring: optimise INIT_WQ_LIST io_uring: don't wake sqpoll in io_cqring_ev_posted io_uring: merge CQ and poll waitqueues io_uring: optimise ctx referencing by requests io_uring: mark cold functions io_uring: optimise io_free_batch_list() io_uring: control ->async_data with a REQ_F flag io_uring: remove struct io_completion io_uring: inline io_req_needs_clean() io_uring: inline io_poll_complete io_uring: correct fill events helpers types io_uring: mark hot functions fs/io-wq.h | 1 - fs/io_uring.c | 390 ++++++++++++++++++++++++++------------------------ 2 files changed, 205 insertions(+), 186 deletions(-) -- 2.33.0