1,2 are simple, can be considered separately 3-8 are inline completion optimisations, should affect buffered rw, recv/send and others that can complete inline. fio/t/io_uring do_nop=1 benchmark (batch=32) in KIOPS: baseline (1-5 applied): qd32: 8001, qd1: 2015 arrays (+6/8): qd32: 8128, qd1: 2028 batching (+7/8): qd32: 10300, qd1: 1946 The downside is worse qd1 with batching, don't think we should care much because most of the time is syscalling, and I can easily get ~15-30% and 5-10% for qd32 and qd1 respectively by making ring's allocation cache persistent and feeding memory of inline executed requests back into it. Note: this should not affect async executed requests, e.g. block rw, because they never hit this path. Pavel Begunkov (8): io_uring: ensure only sqo_task has file notes io_uring: consolidate putting reqs task io_uring: don't keep submit_state on stack io_uring: remove ctx from comp_state io_uring: don't reinit submit state every time io_uring: replace list with array for compl batch io_uring: submit-completion free batching io_uring: keep interrupts on on submit completion fs/io_uring.c | 221 +++++++++++++++++++++++++------------------------- 1 file changed, 110 insertions(+), 111 deletions(-) -- 2.24.0