Re: [PATCH next v1 2/2] io_uring: limit local tw done

Jens Axboe <axboe@xxxxxxxxx> · Thu, 21 Nov 2024 09:57:19 -0700

On 11/21/24 9:43 AM, Pavel Begunkov wrote:
> On 11/21/24 16:20, Jens Axboe wrote:
>> On 11/21/24 9:18 AM, Pavel Begunkov wrote:
>>> On 11/21/24 15:22, Jens Axboe wrote:
>>>> On 11/21/24 8:15 AM, Jens Axboe wrote:
>>>>> I'd rather entertain NOT using llists for this in the first place, as it
>>>>> gets rid of the reversing which is the main cost here. That won't change
>>>>> the need for a retry list necessarily, as I think we'd be better off
>>>>> with a lockless retry list still. But at least it'd get rid of the
>>>>> reversing. Let me see if I can dig out that patch... Totally orthogonal
>>>>> to this topic, obviously.
>>>>
>>>> It's here:
>>>>
>>>> https://lore.kernel.org/io-uring/20240326184615.458820-3-axboe@xxxxxxxxx/
>>>>
>>>> I did improve it further but never posted it again, fwiw.
>>>
>>> io_req_local_work_add() needs a smp_mb() after unlock, see comments,
>>> release/unlock doesn't do it.
>>
>> Yep, current version I have adds a smp_mb__after_unlock_lock() for that.
> 
> I don't think it'd be correct. unlock_lock AFAIK is specifically
> for unlock + lock, you have lock + unlock. And data you want to
> synchronise is modified after the lock part. That'd need upgrading
> the release semantics implied by the unlock to a full barrier.
> 
> I doubt there is a good way to optimise it. I doubt it'd give you
> anything even if you replace store_release in spin_unlock with xchg()
> and ignore the return, but you can probably ask Paul.

True, will just make it an smp_mb(), should be fine.

>> Will do some quick testing, but then also try the double cmpxchg on top
>> of that if supported.

This is a bit trickier, as we're not just updating the list first/last
entries... Not sure I see a good way to do this. Maybe I'm missing
something.

I did run a basic IRQ storage test as-is, and will compare that with the
llist stuff we have now. Just in terms of overhead. It's not quite a
networking test, but you do get the IRQ side and some burstiness in
terms of completions that way too, at high rates. So should be roughly
comparable.

-- 
Jens Axboe