For my a bit exaggerated test case perf continues to show high CPU cosumption by io_dismantle(), and so calling it io_iopoll_complete(). Even though the patch doesn't yield throughput increase for my setup, probably because the effect is hidden behind polling, but it definitely improves relative percentage. And the difference should only grow with increasing number of CPUs. Another reason to have this is that atomics may affect other parallel tasks (e.g. which doesn't use io_uring) before: io_iopoll_complete: 5.29% io_dismantle_req: 2.16% after: io_iopoll_complete: 3.39% io_dismantle_req: 0.465% Pavel Begunkov (2): tasks: add put_task_struct_many() io_uring: batch put_task_struct() fs/io_uring.c | 28 ++++++++++++++++++++++++++-- include/linux/sched/task.h | 6 ++++++ 2 files changed, 32 insertions(+), 2 deletions(-) -- 2.24.0