On 8/20/24 4:47 PM, Pavel Begunkov wrote: > On 8/20/24 23:46, Pavel Begunkov wrote: >> On 8/20/24 00:28, Jens Axboe wrote: >>> Waiting for events with io_uring has two knobs that can be set: >>> >>> 1) The number of events to wake for >>> 2) The timeout associated with the event >>> >>> Waiting will abort when either of those conditions are met, as expected. >>> >>> This adds support for a third event, which is associated with the number >>> of events to wait for. Applications generally like to handle batches of >>> completions, and right now they'd set a number of events to wait for and >>> the timeout for that. If no events have been received but the timeout >>> triggers, control is returned to the application and it can wait again. >>> However, if the application doesn't have anything to do until events are >>> reaped, then it's possible to make this waiting more efficient. >>> >>> For example, the application may have a latency time of 50 usecs and >>> wanting to handle a batch of 8 requests at the time. If it uses 50 usecs >>> as the timeout, then it'll be doing 20K context switches per second even >>> if nothing is happening. >>> >>> This introduces the notion of min batch wait time. If the min batch wait >>> time expires, then we'll return to userspace if we have any events at all. >>> If none are available, the general wait time is applied. Any request >>> arriving after the min batch wait time will cause waiting to stop and >>> return control to the application. >>> >>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >>> --- >>> io_uring/io_uring.c | 75 +++++++++++++++++++++++++++++++++++++++------ >>> io_uring/io_uring.h | 2 ++ >>> 2 files changed, 67 insertions(+), 10 deletions(-) >>> >>> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c >>> index ddfbe04c61ed..d09a7c2e1096 100644 >>> --- a/io_uring/io_uring.c >>> +++ b/io_uring/io_uring.c >>> @@ -2363,13 +2363,62 @@ static enum hrtimer_restart io_cqring_timer_wakeup(struct hrtimer *timer) >>> return HRTIMER_NORESTART; >>> } >>> +/* >>> + * Doing min_timeout portion. If we saw any timeouts, events, or have work, >>> + * wake up. If not, and we have a normal timeout, switch to that and keep >>> + * sleeping. >>> + */ >>> +static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer) >>> +{ >>> + struct io_wait_queue *iowq = container_of(timer, struct io_wait_queue, t); >>> + struct io_ring_ctx *ctx = iowq->ctx; >>> + >>> + /* no general timeout, or shorter, we are done */ >>> + if (iowq->timeout == KTIME_MAX || >>> + ktime_after(iowq->min_timeout, iowq->timeout)) >>> + goto out_wake; >>> + /* work we may need to run, wake function will see if we need to wake */ >>> + if (io_has_work(ctx)) >>> + goto out_wake; >>> + /* got events since we started waiting, min timeout is done */ >>> + if (iowq->cq_min_tail != READ_ONCE(ctx->rings->cq.tail)) >>> + goto out_wake; >>> + /* if we have any events and min timeout expired, we're done */ >>> + if (io_cqring_events(ctx)) >>> + goto out_wake; >>> + >>> + /* >>> + * If using deferred task_work running and application is waiting on >>> + * more than one request, ensure we reset it now where we are switching >>> + * to normal sleeps. Any request completion post min_wait should wake >>> + * the task and return. >>> + */ >>> + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) >>> + atomic_set(&ctx->cq_wait_nr, 1); >> >> racy >> >> atomic_set(&ctx->cq_wait_nr, 1); >> smp_mb(); >> if (llist_empty(&ctx->work_llist)) >> // wake; > > rather if _not_ empty Yep that one was a given :-) Updated it, we'll punt to out_wake at that point. -- Jens Axboe