On 19/07/2020 21:49, Jens Axboe wrote: > On 7/19/20 5:15 AM, Pavel Begunkov wrote: >> On 18/07/2020 17:37, Jens Axboe wrote: >>> On 7/18/20 2:32 AM, Pavel Begunkov wrote: >>>> For my a bit exaggerated test case perf continues to show high CPU >>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete(). >>>> Even though the patch doesn't yield throughput increase for my setup, >>>> probably because the effect is hidden behind polling, but it definitely >>>> improves relative percentage. And the difference should only grow with >>>> increasing number of CPUs. Another reason to have this is that atomics >>>> may affect other parallel tasks (e.g. which doesn't use io_uring) >>>> >>>> before: >>>> io_iopoll_complete: 5.29% >>>> io_dismantle_req: 2.16% >>>> >>>> after: >>>> io_iopoll_complete: 3.39% >>>> io_dismantle_req: 0.465% >>> >>> Still not seeing a win here, but it's clean and it _should_ work. For >> >> Well, if this thing is useful, it'd be hard to quantify, because active >> polling would hide it. I think, it'd need to apply a lot of isolated > > It should be very visible in my setup, as we're CPU limited, not device > limited. Hence it makes it very easy to show CPU gains, as they directly > translate into improved performance. IIRC, atomics for x64 in a single thread don't hurt too much. Disregarding this patch, it would be good to have a many-threaded benchmark to look after scalability. >> pressure on cache synchronisation (e.g. spam with barriers), or try to >> create and measure an atomic heavy task pinned to another core. Don't >> worth the effort IMHO. >> ` >> Just out of curiosity, let me ask how do you test it? >> - is it a VM? >> - how many cores and threads do you use? >> - how many io_uring instances you have? Per thread? >> - Is it all goes to a single NVMe SSD? > > It's not a VM, it's a normal box. I'm using just one CPU, one thread, > and just one NVMe device. That's my goto test for seeing if we reclaimed > some CPU cycles. Got it, thanks -- Pavel Begunkov