On 3/27/24 10:36 AM, Jens Axboe wrote: > On 3/27/24 7:33 AM, Pavel Begunkov wrote: >> On 3/26/24 18:42, Jens Axboe wrote: >>> Hi, >>> >>> This converts the deferred, normal, and fallback task_work to use a >>> normal io_wq_work_list, rather than an llist. >>> >>> The main motivation behind this is to get rid of the need to reverse >>> the list once it's deleted and run. I tested this basic conversion of >>> just switching it from an llist to an io_wq_work_list with a spinlock, >>> and I don't see any benefits from the lockless list. And for cases where >>> we get a bursty addition of task_work, this approach is faster as it >>> avoids the need to iterate the list upfront while reversing it. >> >> I'm curious how you benchmarked it including accounting of irq/softirq >> where tw add usually happens? > > Performance based and profiles. I tested send zc with small packets, as > that is task_work intensive and exhibits the bursty behavior I mentioned > in the patch / cover letter. And normal storage IO, IRQ driven. > > For send zc, we're spending about 2% of the time doing list reversal, > and I've seen as high as 5 in other testing. And as that test is CPU > bound, performance is up about 2% as well. > > With the patches, task work adding accounts for about 0.25% of the > cycles, before it's about 0.66%. > > We're spending a bit more time in __io_run_local_work(), but I think > that's deceptive as we have to disable/enable interrupts now. If an > interrupt triggers on the unlock, that time tends to be attributed there > in terms of cycles. Forgot to mention the storage side - profiles look eerily similar in terms of time spent in task work adding / running, the only real difference is that 1.9% of llist_reverse_list() is gone. -- Jens Axboe