On 9/24/19 12:28 PM, Pavel Begunkov wrote: > On 24/09/2019 20:46, Jens Axboe wrote: >> On 9/24/19 11:33 AM, Pavel Begunkov wrote: >>> On 24/09/2019 16:13, Jens Axboe wrote: >>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote: >>>>>> Yep that should do it, and saves 8 bytes of stack as well. >>>>>> >>>>>> BTW, did you test my patch, this one or the previous? Just curious if it >>>>>> worked for you. >>>>>> >>>>> Not yet, going to do that tonight >>>> >>>> Thanks! For reference, the final version is below. There was still a >>>> signal mishap in there, now it should all be correct afaict. >>>> >>>> >>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>> index 9b84232e5cc4..d2a86164d520 100644 >>>> --- a/fs/io_uring.c >>>> +++ b/fs/io_uring.c >>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, >>>> return submit; >>>> } >>>> >>>> +struct io_wait_queue { >>>> + struct wait_queue_entry wq; >>>> + struct io_ring_ctx *ctx; >>>> + unsigned to_wait; >>>> + unsigned nr_timeouts; >>>> +}; >>>> + >>>> +static inline bool io_should_wake(struct io_wait_queue *iowq) >>>> +{ >>>> + struct io_ring_ctx *ctx = iowq->ctx; >>>> + >>>> + /* >>>> + * Wake up if we have enough events, or if a timeout occured since we >>>> + * started waiting. For timeouts, we always want to return to userspace, >>>> + * regardless of event count. >>>> + */ >>>> + return io_cqring_events(ctx->rings) >= iowq->to_wait || >>>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts; >>>> +} >>>> + >>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, >>>> + int wake_flags, void *key) >>>> +{ >>>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue, >>>> + wq); >>>> + >>>> + if (!io_should_wake(iowq)) >>>> + return -1; >>> >>> It would try to schedule only the first task in the wait list. Is that the >>> semantic you want? >>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns >>> after @32, and won't wake up the second one. >> >> Right, those are the semantics I want. We keep the list ordered by using >> the exclusive wait addition. Which means that for the case you list, >> waiters=32 came first, and we should not wake others before that task >> gets the completions it wants. Otherwise we could potentially starve >> higher count waiters, if we always keep going and new waiters come in. >> > Yes. I think It would better to be documented in userspace API. I > could imagine some crazy case deadlocking userspace. E.g. > thread 1: wait_events(8), reap_events > thread 2: wait_events(32), wait(thread 1), reap_events No matter how you handle cases like this, there will always be deadlocks possible... So I don't think that's a huge concern. It's more important to not have intentional livelocks, which we would have if we always allowed the lowest wait count to get woken and steal the budget everytime. > works well > Reviewed-by: Pavel Begunkov <asml.silence@xxxxxxxxx> > Tested-by: Pavel Begunkov <asml.silence@xxxxxxxxx> Thanks, will add! > BTW, I searched for wait_event*(), and it seems there are plenty of > similar use cases. So, generic case would be useful, but this is for > later. Agree, it would undoubtedly be useful. -- Jens Axboe