Re: [PATCH 0/2] task_put batching

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sun, 19 Jul 2020 14:15:21 +0300

On 18/07/2020 17:37, Jens Axboe wrote:
> On 7/18/20 2:32 AM, Pavel Begunkov wrote:
>> For my a bit exaggerated test case perf continues to show high CPU
>> cosumption by io_dismantle(), and so calling it io_iopoll_complete().
>> Even though the patch doesn't yield throughput increase for my setup,
>> probably because the effect is hidden behind polling, but it definitely
>> improves relative percentage. And the difference should only grow with
>> increasing number of CPUs. Another reason to have this is that atomics
>> may affect other parallel tasks (e.g. which doesn't use io_uring)
>>
>> before:
>> io_iopoll_complete: 5.29%
>> io_dismantle_req:   2.16%
>>
>> after:
>> io_iopoll_complete: 3.39%
>> io_dismantle_req:   0.465%
> 
> Still not seeing a win here, but it's clean and it _should_ work. For

Well, if this thing is useful, it'd be hard to quantify, because active
polling would hide it. I think, it'd need to apply a lot of isolated
pressure on cache synchronisation (e.g. spam with barriers), or try to
create and measure an atomic heavy task pinned to another core. Don't
worth the effort IMHO.
`
Just out of curiosity, let me ask how do you test it?
- is it a VM?
- how many cores and threads do you use?
- how many io_uring instances you have? Per thread?
- Is it all goes to a single NVMe SSD?

> some reason I end up getting the offset in task ref put growing the
> fput_many(). Which doesn't (on the surface) make a lot of sense, but
> may just mean that we have some weird side effects.

I'll take a look whether I can reproduce.

-- 
Pavel Begunkov