Unfolding previous ideas on persistent req caches. 4-7 including slashed ~20% of overhead for nops benchmark, haven't done benchmarking personally for this yet, but according to perf should be ~30-40% in total. That's for IOPOLL + inline completion cases, obviously w/o async/IRQ completions. Jens, 1. 11/17 removes deallocations on end of submit_sqes. Looks you forgot or just didn't do that. 2. lists are slow and not great cache-wise, that why at I want at least a combined approach from 12/17. 3. Instead of lists in "use persistent request cache" I had in mind a slightly different way: to grow the req alloc cache to 32-128 (or hint from the userspace), batch-alloc by 8 as before, and recycle _all_ reqs right into it. If overflows, do kfree(). It should give probabilistically high hit rate, amortising out most of allocations. Pros: it doesn't grow ~infinitely as lists can. Cons: there are always counter examples. But as I don't have numbers to back it, I took your implementation. Maybe, we'll reconsider later. I'll revise tomorrow on a fresh head + do some performance testing, and is leaving it RFC until then. Jens Axboe (3): io_uring: use persistent request cache io_uring: provide FIFO ordering for task_work io_uring: enable req cache for task_work items Pavel Begunkov (14): io_uring: replace force_nonblock with flags io_uring: make op handlers always take issue flags io_uring: don't propagate io_comp_state io_uring: don't keep submit_state on stack io_uring: remove ctx from comp_state io_uring: don't reinit submit state every time io_uring: replace list with array for compl batch io_uring: submit-completion free batching io_uring: remove fallback_req io_uring: count ctx refs separately from reqs io_uring: persistent req cache io_uring: feed reqs back into alloc cache io_uring: take comp_state from ctx io_uring: defer flushing cached reqs fs/io-wq.h | 9 - fs/io_uring.c | 716 ++++++++++++++++++++++----------------- include/linux/io_uring.h | 14 + 3 files changed, 425 insertions(+), 314 deletions(-) -- 2.24.0