On 11/24/21 12:21, Hao Xu wrote:
v4->v5 - change the implementation of merge_wq_list
They only concern I had was about 6/6 not using inline completion infra, when it's faster to grab ->uring_lock. i.e. io_submit_flush_completions(), which should be faster when batching is good. Looking again through the code, the only user is SQPOLL io_req_task_work_add(req, !!(req->ctx->flags & IORING_SETUP_SQPOLL)); And with SQPOLL the lock is mostly grabbed by the SQPOLL task only, IOW for pure block rw there shouldn't be any contention. Doesn't make much sense, what am I missing? How many requests are completed on average per tctx_task_work()? It doesn't apply to for-5.17/io_uring, here is a rebase: https://github.com/isilence/linux.git haoxu_tw_opt link: https://github.com/isilence/linux/tree/haoxu_tw_opt With that first 5 patches look good, so for them: Reviewed-by: Pavel Begunkov <asml.silence@xxxxxxxxx> but I still don't understand how 6/6 is better. Can it be because of indirect branching? E.g. would something like this give the result? - req->io_task_work.func(req, locked); + INDIRECT_CALL_1(req->io_task_work.func, io_req_task_complete, req, locked);
Hao Xu (6): io-wq: add helper to merge two wq_lists io_uring: add a priority tw list for irq completion work io_uring: add helper for task work execution code io_uring: split io_req_complete_post() and add a helper io_uring: move up io_put_kbuf() and io_put_rw_kbuf() io_uring: batch completion in prior_task_list fs/io-wq.h | 22 +++++++ fs/io_uring.c | 158 +++++++++++++++++++++++++++++++++----------------- 2 files changed, 128 insertions(+), 52 deletions(-)
-- Pavel Begunkov