On 18/07/2020 17:37, Jens Axboe wrote: > On 7/18/20 2:32 AM, Pavel Begunkov wrote: >> For my a bit exaggerated test case perf continues to show high CPU >> cosumption by io_dismantle(), and so calling it io_iopoll_complete(). >> Even though the patch doesn't yield throughput increase for my setup, >> probably because the effect is hidden behind polling, but it definitely >> improves relative percentage. And the difference should only grow with >> increasing number of CPUs. Another reason to have this is that atomics >> may affect other parallel tasks (e.g. which doesn't use io_uring) >> >> before: >> io_iopoll_complete: 5.29% >> io_dismantle_req: 2.16% >> >> after: >> io_iopoll_complete: 3.39% >> io_dismantle_req: 0.465% > > Still not seeing a win here, but it's clean and it _should_ work. For Well, if this thing is useful, it'd be hard to quantify, because active polling would hide it. I think, it'd need to apply a lot of isolated pressure on cache synchronisation (e.g. spam with barriers), or try to create and measure an atomic heavy task pinned to another core. Don't worth the effort IMHO. ` Just out of curiosity, let me ask how do you test it? - is it a VM? - how many cores and threads do you use? - how many io_uring instances you have? Per thread? - Is it all goes to a single NVMe SSD? > some reason I end up getting the offset in task ref put growing the > fput_many(). Which doesn't (on the surface) make a lot of sense, but > may just mean that we have some weird side effects. I'll take a look whether I can reproduce. -- Pavel Begunkov