Re: [PATCH next v1 2/2] io_uring: limit local tw done

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sat, 23 Nov 2024 00:50:32 +0000

On 11/22/24 17:08, Jens Axboe wrote:
On 11/22/24 10:01 AM, Pavel Begunkov wrote:
On 11/21/24 17:05, Jens Axboe wrote:
On 11/21/24 9:57 AM, Jens Axboe wrote:
I did run a basic IRQ storage test as-is, and will compare that with the
llist stuff we have now. Just in terms of overhead. It's not quite a
networking test, but you do get the IRQ side and some burstiness in
terms of completions that way too, at high rates. So should be roughly
comparable.

Perf looks comparable, it's about 60M IOPS. Some fluctuation with IRQ

60M with iopoll? That one normally shouldn't use use task_work

Maybe that wasn't clear, but it's IRQ driven IO. Otherwise indeed
there'd be no task_work in use.

driven, so won't render an opinion on whether one is faster than the
other. What is visible though is that adding and running local task_work
drops from 2.39% to 2.02% using spinlock + io_wq_work_list over llist,

Do you summed it up with io_req_local_work_add()? Just sounds a bit
weird since it's usually run off [soft]irq. I have doubts that part
became faster. Running could be, especially with high QD and
consistency of SSD. Btw, what QD was it? 32?

Why I asked about QD is because storage tests reliably give
you a list of QD task work items, the longer the list the
more expensive the reverse with washing out cache lines.

For QD=32 it's 32 entry list reversal, so I'd get if you're
seeing some perf imrpovement. With QD=1 would be the opposite.
With David's thing is similar, he gets a long list because of
wait based batching. Users who don't do it might get a worse
performance (which might be fine).

It may just trigger more in frequency in terms of profiling, since the
list reversal is done. Profiling isn't 100% exact.

and we entirely drop 2.2% of list reversing in the process.

We actually discussed it before but in some different patchset,
perf is not helpful much here, the overhead and cache loading
moves around a lot between functions.

I don't think we have a solid proof here, especially for networking
workloads, which tend to hammer it more from more CPUs. Can we run
some net benchmarks? Even better to do a good prod experiment.

Already in motion. I ran some here and didn't show any differences at
all, but task_work load was also fairly light. David is running the
networking side and we'll see what it says.

That's great, if it survives high traffic prod there should be
less need to worry about it in terms of regressing.

The eerie part is that we're switching it back and forth rediscovering
same problems. Even the reordering issue was mentioned and warned
about before the wait-free list got merged, but successfully ignored
until we've got latency issues. And now we're back the full circle.
Would be nice to find some peace (or something inarguably better).

I don't particularly love list + lock for this, but at the end of the
day, the only real downside is the irq disabling nature of it.
Everything else is both simpler, and avoids the really annoying LIFO
nature of llist. I'd expect, all things being equal, that list + lock is
going to be ever so slightly slower. Both will bounce the list
cacheline, no difference in cost on that side. But when you add list
reversal to the mix, that's going to push it to being an overall win.

--
Pavel Begunkov