There are two main lines intervened. The first one is pt.2 of ctx field shuffling for better caching. There is a couple of things left on that front. The second is optimising (assumably) rarely used offset-based timeouts and draining. There is a downside (see 12/12), which will be fixed later. In plans to queue a task_work clearing drain_used (under uring_lock) from io_queue_deferred() once all drainee are gone. nops(batch=32): 15.9 MIOPS vs 17.3 MIOPS nullblk (irqmode=2 completion_nsec=0 submit_queues=16), no merges, no stat 1002 KIOPS vs 1050 KIOPS Though the second test is very slow comparing to what I've seen before, so might be not represantative. Pavel Begunkov (12): io_uring: keep SQ pointers in a single cacheline io_uring: move ctx->flags from SQ cacheline io_uring: shuffle more fields into SQ ctx section io_uring: refactor io_get_sqe() io_uring: don't cache number of dropped SQEs io_uring: optimise completion timeout flushing io_uring: small io_submit_sqe() optimisation io_uring: clean up check_overflow flag io_uring: wait heads renaming io_uring: move uring_lock location io_uring: refactor io_req_defer() io_uring: optimise non-drain path fs/io_uring.c | 226 +++++++++++++++++++++++++------------------------- 1 file changed, 111 insertions(+), 115 deletions(-) -- 2.31.1