Re: [PATCHSET 0/4] Use io_wq_work_list for task_work

Jens Axboe <axboe@xxxxxxxxx> · Wed, 27 Mar 2024 10:36:57 -0600

On 3/27/24 7:33 AM, Pavel Begunkov wrote:
> On 3/26/24 18:42, Jens Axboe wrote:
>> Hi,
>>
>> This converts the deferred, normal, and fallback task_work to use a
>> normal io_wq_work_list, rather than an llist.
>>
>> The main motivation behind this is to get rid of the need to reverse
>> the list once it's deleted and run. I tested this basic conversion of
>> just switching it from an llist to an io_wq_work_list with a spinlock,
>> and I don't see any benefits from the lockless list. And for cases where
>> we get a bursty addition of task_work, this approach is faster as it
>> avoids the need to iterate the list upfront while reversing it.
> 
> I'm curious how you benchmarked it including accounting of irq/softirq
> where tw add usually happens?

Performance based and profiles. I tested send zc with small packets, as
that is task_work intensive and exhibits the bursty behavior I mentioned
in the patch / cover letter. And normal storage IO, IRQ driven.

For send zc, we're spending about 2% of the time doing list reversal,
and I've seen as high as 5 in other testing. And as that test is CPU
bound, performance is up about 2% as well.

With the patches, task work adding accounts for about 0.25% of the
cycles, before it's about 0.66%.

We're spending a bit more time in __io_run_local_work(), but I think
that's deceptive as we have to disable/enable interrupts now. If an
interrupt triggers on the unlock, that time tends to be attributed there
in terms of cycles.

> One known problem with the current list approach I mentioned several
> times before is that it peeks at the previous queued tw to count them.
> It's not nice, but that can be easily done with cmpxchg double. I
> wonder how much of an issue is that.

That's more of a wart than a real issue though, but we this approach
obviously doesn't do that. And then we can drop the rcu section around
adding local task_work. Not a huge deal, but still nice.

>> And this is less code and simpler, so I'd prefer to go that route.
> 
> I'm not sure it's less code, if you return optimisations that I
> believe were killed, see comments to patch 2, it might turn out to
> be even bulkier and not that simpler.

It's still considerably less:

 3 files changed, 59 insertions(+), 84 deletions(-)

thought that isn't conclusive by itself, as eg io_llist_xchg() goes away
which has some comments. But I do think the resulting code is simpler
and more straight forward.

-- 
Jens Axboe