Re: [PATCH next v1 2/2] io_uring: limit local tw done

Jens Axboe <axboe@xxxxxxxxx> · Thu, 21 Nov 2024 07:31:29 -0700

On 11/21/24 7:25 AM, Pavel Begunkov wrote:
> On 11/21/24 01:12, Jens Axboe wrote:
>> On 11/20/24 4:56 PM, Pavel Begunkov wrote:
>>> On 11/20/24 22:14, David Wei wrote:
> ...
>>> One thing that is not so nice is that now we have this handling and
>>> checks in the hot path, and __io_run_local_work_loop() most likely
>>> gets uninlined.
>>
>> I don't think that really matters, it's pretty light. The main overhead
>> in this function is not the call, it's reordering requests and touching
>> cachelines of the requests.
>>
>> I think it's pretty light as-is and actually looks pretty good. It's
> 
> It could be light, but the question is importance / frequency of
> the new path. If it only happens rarely but affects a high 9,
> then it'd more sense to optimise it from the common path.

I'm more worried about the outlier cases. We don't generally expect this
to trigger very much obviously, if long chains of task_work was the norm
then we'd have other reports/issues related to that. But the common
overhead here is really just checking if another (same cacheline)
pointer is non-NULL, and ditto on the run side. Really don't think
that's anything to worry about.

>> also similar to how sqpoll bites over longer task_work lines, and
>> arguably a mistake that we allow huge depths of this when we can avoid
>> it with deferred task_work.
>>
>>> I wonder, can we just requeue it via task_work again? We can even
>>> add a variant efficiently adding a list instead of a single entry,
>>> i.e. local_task_work_add(head, tail, ...);
>>
>> I think that can only work if we change work_llist to be a regular list
>> with regular locking. Otherwise it's a bit of a mess with the list being
> 
> Dylan once measured the overhead of locks vs atomics in this
> path for some artificial case, we can pull the numbers up.

I did it more recently if you'll remember, actually posted a patch I
think a few months ago changing it to that. But even that approach adds
extra overhead, if you want to add it to the same list as now you need
to re-grab (and re-disable interrupts) the lock to add it back. My gut
says that would be _worse_ than the current approach. And if you keep a
separate list instead, well then you're back to identical overhead in
terms of now needing to check both when needing to know if anything is
pending, and checking both when running it.

>> reordered, and then you're spending extra cycles on potentially
>> reordering all the entries again.
> 
> That sucks, I agree, but then it's same question of how often
> it happens.

At least for now, there's a real issue reported and we should fix it. I
think the current patches are fine in that regard. That doesn't mean we
can't potentially make it better, we should certainly investigate that.
But I don't see the current patches as being suboptimal really, they are
definitely good enough as-is for solving the issue.

-- 
Jens Axboe