Re: [PATCHSET 0/4] Use io_wq_work_list for task_work

Jens Axboe <axboe@xxxxxxxxx> · Wed, 27 Mar 2024 11:05:46 -0600

On 3/27/24 10:36 AM, Jens Axboe wrote:
> On 3/27/24 7:33 AM, Pavel Begunkov wrote:
>> On 3/26/24 18:42, Jens Axboe wrote:
>>> Hi,
>>>
>>> This converts the deferred, normal, and fallback task_work to use a
>>> normal io_wq_work_list, rather than an llist.
>>>
>>> The main motivation behind this is to get rid of the need to reverse
>>> the list once it's deleted and run. I tested this basic conversion of
>>> just switching it from an llist to an io_wq_work_list with a spinlock,
>>> and I don't see any benefits from the lockless list. And for cases where
>>> we get a bursty addition of task_work, this approach is faster as it
>>> avoids the need to iterate the list upfront while reversing it.
>>
>> I'm curious how you benchmarked it including accounting of irq/softirq
>> where tw add usually happens?
> 
> Performance based and profiles. I tested send zc with small packets, as
> that is task_work intensive and exhibits the bursty behavior I mentioned
> in the patch / cover letter. And normal storage IO, IRQ driven.
> 
> For send zc, we're spending about 2% of the time doing list reversal,
> and I've seen as high as 5 in other testing. And as that test is CPU
> bound, performance is up about 2% as well.
> 
> With the patches, task work adding accounts for about 0.25% of the
> cycles, before it's about 0.66%.
> 
> We're spending a bit more time in __io_run_local_work(), but I think
> that's deceptive as we have to disable/enable interrupts now. If an
> interrupt triggers on the unlock, that time tends to be attributed there
> in terms of cycles.

Forgot to mention the storage side - profiles look eerily similar in
terms of time spent in task work adding / running, the only real
difference is that 1.9% of llist_reverse_list() is gone.

-- 
Jens Axboe