Re: [PATCH v2 0/2] Optimise io_uring completion waiting

Jens Axboe <axboe@xxxxxxxxx> · Tue, 24 Sep 2019 21:32:22 +0200

On 9/24/19 12:28 PM, Pavel Begunkov wrote:
> On 24/09/2019 20:46, Jens Axboe wrote:
>> On 9/24/19 11:33 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 16:13, Jens Axboe wrote:
>>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>>>
>>>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>>>> worked for you.
>>>>>>
>>>>> Not yet, going to do that tonight
>>>>
>>>> Thanks! For reference, the final version is below. There was still a
>>>> signal mishap in there, now it should all be correct afaict.
>>>>
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 9b84232e5cc4..d2a86164d520 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>>>>    	return submit;
>>>>    }
>>>>    
>>>> +struct io_wait_queue {
>>>> +	struct wait_queue_entry wq;
>>>> +	struct io_ring_ctx *ctx;
>>>> +	unsigned to_wait;
>>>> +	unsigned nr_timeouts;
>>>> +};
>>>> +
>>>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>>>> +{
>>>> +	struct io_ring_ctx *ctx = iowq->ctx;
>>>> +
>>>> +	/*
>>>> +	 * Wake up if we have enough events, or if a timeout occured since we
>>>> +	 * started waiting. For timeouts, we always want to return to userspace,
>>>> +	 * regardless of event count.
>>>> +	 */
>>>> +	return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>>>> +			atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>>>> +}
>>>> +
>>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>>>> +			    int wake_flags, void *key)
>>>> +{
>>>> +	struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>>>> +							wq);
>>>> +
>>>> +	if (!io_should_wake(iowq))
>>>> +		return -1;
>>>
>>> It would try to schedule only the first task in the wait list. Is that the
>>> semantic you want?
>>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
>>> after @32, and won't wake up the second one.
>>
>> Right, those are the semantics I want. We keep the list ordered by using
>> the exclusive wait addition. Which means that for the case you list,
>> waiters=32 came first, and we should not wake others before that task
>> gets the completions it wants. Otherwise we could potentially starve
>> higher count waiters, if we always keep going and new waiters come in.
>>
> Yes. I think It would better to be documented in userspace API. I
> could imagine some crazy case deadlocking userspace. E.g.
> thread 1: wait_events(8), reap_events
> thread 2: wait_events(32), wait(thread 1), reap_events

No matter how you handle cases like this, there will always be deadlocks
possible... So I don't think that's a huge concern. It's more important
to not have intentional livelocks, which we would have if we always
allowed the lowest wait count to get woken and steal the budget
everytime.

> works well
> Reviewed-by: Pavel Begunkov <asml.silence@xxxxxxxxx>
> Tested-by: Pavel Begunkov <asml.silence@xxxxxxxxx>

Thanks, will add!

> BTW, I searched for wait_event*(), and it seems there are plenty of
> similar use cases. So, generic case would be useful, but this is for
> later.

Agree, it would undoubtedly be useful.

-- 
Jens Axboe