On 3/27/24 7:33 AM, Pavel Begunkov wrote: > On 3/26/24 18:42, Jens Axboe wrote: >> Hi, >> >> This converts the deferred, normal, and fallback task_work to use a >> normal io_wq_work_list, rather than an llist. >> >> The main motivation behind this is to get rid of the need to reverse >> the list once it's deleted and run. I tested this basic conversion of >> just switching it from an llist to an io_wq_work_list with a spinlock, >> and I don't see any benefits from the lockless list. And for cases where >> we get a bursty addition of task_work, this approach is faster as it >> avoids the need to iterate the list upfront while reversing it. > > I'm curious how you benchmarked it including accounting of irq/softirq > where tw add usually happens? Performance based and profiles. I tested send zc with small packets, as that is task_work intensive and exhibits the bursty behavior I mentioned in the patch / cover letter. And normal storage IO, IRQ driven. For send zc, we're spending about 2% of the time doing list reversal, and I've seen as high as 5 in other testing. And as that test is CPU bound, performance is up about 2% as well. With the patches, task work adding accounts for about 0.25% of the cycles, before it's about 0.66%. We're spending a bit more time in __io_run_local_work(), but I think that's deceptive as we have to disable/enable interrupts now. If an interrupt triggers on the unlock, that time tends to be attributed there in terms of cycles. > One known problem with the current list approach I mentioned several > times before is that it peeks at the previous queued tw to count them. > It's not nice, but that can be easily done with cmpxchg double. I > wonder how much of an issue is that. That's more of a wart than a real issue though, but we this approach obviously doesn't do that. And then we can drop the rcu section around adding local task_work. Not a huge deal, but still nice. >> And this is less code and simpler, so I'd prefer to go that route. > > I'm not sure it's less code, if you return optimisations that I > believe were killed, see comments to patch 2, it might turn out to > be even bulkier and not that simpler. It's still considerably less: 3 files changed, 59 insertions(+), 84 deletions(-) thought that isn't conclusive by itself, as eg io_llist_xchg() goes away which has some comments. But I do think the resulting code is simpler and more straight forward. -- Jens Axboe