On 20/07/2020 19:11, Jens Axboe wrote: > On 7/20/20 10:06 AM, Pavel Begunkov wrote: >> On 20/07/2020 18:49, Jens Axboe wrote: >>> On 7/20/20 9:22 AM, Pavel Begunkov wrote: >>>> On 18/07/2020 17:37, Jens Axboe wrote: >>>>> On 7/18/20 2:32 AM, Pavel Begunkov wrote: >>>>>> For my a bit exaggerated test case perf continues to show high CPU >>>>>> cosumption by io_dismantle(), and so calling it io_iopoll_complete(). >>>>>> Even though the patch doesn't yield throughput increase for my setup, >>>>>> probably because the effect is hidden behind polling, but it definitely >>>>>> improves relative percentage. And the difference should only grow with >>>>>> increasing number of CPUs. Another reason to have this is that atomics >>>>>> may affect other parallel tasks (e.g. which doesn't use io_uring) >>>>>> >>>>>> before: >>>>>> io_iopoll_complete: 5.29% >>>>>> io_dismantle_req: 2.16% >>>>>> >>>>>> after: >>>>>> io_iopoll_complete: 3.39% >>>>>> io_dismantle_req: 0.465% >>>>> >>>>> Still not seeing a win here, but it's clean and it _should_ work. For >>>>> some reason I end up getting the offset in task ref put growing the >>>>> fput_many(). Which doesn't (on the surface) make a lot of sense, but >>>>> may just mean that we have some weird side effects. >>>> >>>> It grows because the patch is garbage, the second condition is always false. >>>> See the diff. Could you please drop both patches? >>> >>> Hah, indeed. With this on top, it looks like it should in terms of >>> performance and profiles. >> >> It just shows, that it doesn't really matters for a single-threaded app, >> as expected. Worth to throw some contention though. I'll think about >> finding some time to get/borrow a multi-threaded one. > > But it kind of did here, ended up being mostly a wash in terms of perf > here as my testing reported. With the incremental applied, it's up a bit > over before the task put batching. Hmm, I need to get used to sensitivity of your box, that's a good one! Do you mean, that the buggy version without atomics was on par comparing to not having it at all, but the fixed/updated one is a bit faster? Sounds like micro binary differences, like a bit altered jumps. It'd also interesting to know, what degree of coalescing in io_iopoll_complete() you manage to get with that. >>> I can just fold this into the existing one, if you'd like. >> >> Would be nice. I'm going to double-check the counter and re-measure anyway. >> BTW, how did you find it? A tool or a proc file would be awesome. > > For this kind of testing, I just use t/io_uring out of fio. It's probably > the lowest overhead kind of tool: > > # sudo taskset -c 0 t/io_uring -b512 -p1 /dev/nvme2n1 I use io_uring-bench.c from time to time, but didn't know it continued living under fio/t/. Thanks! I also put it under cshield for more consistency, but it looks like io-wq ignores that. -- Pavel Begunkov