On 2024-08-21 07:16, Jens Axboe wrote: > Waiting for events with io_uring has two knobs that can be set: > > 1) The number of events to wake for > 2) The timeout associated with the event > > Waiting will abort when either of those conditions are met, as expected. > > This adds support for a third event, which is associated with the number > of events to wait for. Applications generally like to handle batches of > completions, and right now they'd set a number of events to wait for and > the timeout for that. If no events have been received but the timeout > triggers, control is returned to the application and it can wait again. > However, if the application doesn't have anything to do until events are > reaped, then it's possible to make this waiting more efficient. > > For example, the application may have a latency time of 50 usecs and > wanting to handle a batch of 8 requests at the time. If it uses 50 usecs > as the timeout, then it'll be doing 20K context switches per second even > if nothing is happening. > > This introduces the notion of min batch wait time. If the min batch wait > time expires, then we'll return to userspace if we have any events at all. > If none are available, the general wait time is applied. Any request > arriving after the min batch wait time will cause waiting to stop and > return control to the application. > > Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> > --- > io_uring/io_uring.c | 88 ++++++++++++++++++++++++++++++++++++++------- > io_uring/io_uring.h | 2 ++ > 2 files changed, 77 insertions(+), 13 deletions(-) > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 4ba5292137c3..87e7cf6551d7 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -2322,7 +2322,8 @@ static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode, > * Cannot safely flush overflowed CQEs from here, ensure we wake up > * the task, and the next invocation will do it. > */ > - if (io_should_wake(iowq) || io_has_work(iowq->ctx) || iowq->hit_timeout) > + if (io_should_wake(iowq) || io_has_work(iowq->ctx) || > + READ_ONCE(iowq->hit_timeout)) > return autoremove_wake_function(curr, mode, wake_flags, key); > return -1; > } > @@ -2359,13 +2360,66 @@ static enum hrtimer_restart io_cqring_timer_wakeup(struct hrtimer *timer) > return HRTIMER_NORESTART; > } > > +/* > + * Doing min_timeout portion. If we saw any timeouts, events, or have work, > + * wake up. If not, and we have a normal timeout, switch to that and keep > + * sleeping. > + */ > +static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer) > +{ > + struct io_wait_queue *iowq = container_of(timer, struct io_wait_queue, t); > + struct io_ring_ctx *ctx = iowq->ctx; > + > + /* no general timeout, or shorter, we are done */ > + if (iowq->timeout == KTIME_MAX || > + ktime_after(iowq->min_timeout, iowq->timeout)) > + goto out_wake; > + /* work we may need to run, wake function will see if we need to wake */ > + if (io_has_work(ctx)) > + goto out_wake; > + /* got events since we started waiting, min timeout is done */ > + if (iowq->cq_min_tail != READ_ONCE(ctx->rings->cq.tail)) > + goto out_wake; > + /* if we have any events and min timeout expired, we're done */ > + if (io_cqring_events(ctx)) > + goto out_wake; > + > + /* > + * If using deferred task_work running and application is waiting on > + * more than one request, ensure we reset it now where we are switching > + * to normal sleeps. Any request completion post min_wait should wake > + * the task and return. > + */ > + if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) { > + atomic_set(&ctx->cq_wait_nr, 1); > + smp_mb(); > + if (!llist_empty(&ctx->work_llist)) > + goto out_wake; > + } > + > + iowq->t.function = io_cqring_timer_wakeup; > + hrtimer_set_expires(timer, iowq->timeout); What happens if timeout < min_timeout? Would the timer expired callback io_cqring_timer_wakeup() be called right away? > + return HRTIMER_RESTART; > +out_wake: > + return io_cqring_timer_wakeup(timer); > +} > + > static int io_cqring_schedule_timeout(struct io_wait_queue *iowq, > - clockid_t clock_id) > + clockid_t clock_id, ktime_t start_time) > { > - iowq->hit_timeout = 0; > + ktime_t timeout; > + > + WRITE_ONCE(iowq->hit_timeout, 0); > hrtimer_init_on_stack(&iowq->t, clock_id, HRTIMER_MODE_ABS); > - iowq->t.function = io_cqring_timer_wakeup; > - hrtimer_set_expires_range_ns(&iowq->t, iowq->timeout, 0); > + if (iowq->min_timeout) { > + timeout = ktime_add_ns(iowq->min_timeout, start_time); > + iowq->t.function = io_cqring_min_timer_wakeup; > + } else { > + timeout = iowq->timeout; > + iowq->t.function = io_cqring_timer_wakeup; > + } > + > + hrtimer_set_expires_range_ns(&iowq->t, timeout, 0); > hrtimer_start_expires(&iowq->t, HRTIMER_MODE_ABS); > > if (!READ_ONCE(iowq->hit_timeout)) > @@ -2379,7 +2433,8 @@ static int io_cqring_schedule_timeout(struct io_wait_queue *iowq, > } > > static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx, > - struct io_wait_queue *iowq) > + struct io_wait_queue *iowq, > + ktime_t start_time) > { > int ret = 0; > > @@ -2390,8 +2445,8 @@ static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx, > */ > if (current_pending_io()) > current->in_iowait = 1; > - if (iowq->timeout != KTIME_MAX) > - ret = io_cqring_schedule_timeout(iowq, ctx->clockid); > + if (iowq->timeout != KTIME_MAX || iowq->min_timeout != KTIME_MAX) > + ret = io_cqring_schedule_timeout(iowq, ctx->clockid, start_time); In this case it is possible for either timeout or min_timeout to be KTIME_MAX and still schedule a timeout. If min_timeout != KTIME_MAX and timeout == KTIME_MAX, then io_cqring_min_timer_wakeup() will reset itself to a timer with KTIME_MAX. If min_timeout == KTIME_MAX and timeout != KTIME_MAX, then a KTIME_MAX timer will be set. This should be fine, the timer will never expire and schedule() is called regardless. The previous code is a small optimisation to avoid setting up a timer that will never expire. > else > schedule(); > current->in_iowait = 0; > @@ -2400,7 +2455,8 @@ static int __io_cqring_wait_schedule(struct io_ring_ctx *ctx, > > /* If this returns > 0, the caller should retry */ > static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, > - struct io_wait_queue *iowq) > + struct io_wait_queue *iowq, > + ktime_t start_time) > { > if (unlikely(READ_ONCE(ctx->check_cq))) > return 1; > @@ -2413,7 +2469,7 @@ static inline int io_cqring_wait_schedule(struct io_ring_ctx *ctx, > if (unlikely(io_should_wake(iowq))) > return 0; > > - return __io_cqring_wait_schedule(ctx, iowq); > + return __io_cqring_wait_schedule(ctx, iowq, start_time); > } > > struct ext_arg { > @@ -2431,6 +2487,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, > { > struct io_wait_queue iowq; > struct io_rings *rings = ctx->rings; > + ktime_t start_time; > int ret; > > if (!io_allowed_run_tw(ctx)) > @@ -2449,8 +2506,11 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, > INIT_LIST_HEAD(&iowq.wq.entry); > iowq.ctx = ctx; > iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts); > + iowq.cq_min_tail = READ_ONCE(ctx->rings->cq.tail); > iowq.cq_tail = READ_ONCE(ctx->rings->cq.head) + min_events; > + iowq.min_timeout = 0; > iowq.timeout = KTIME_MAX; > + start_time = io_get_time(ctx); > > if (ext_arg->ts) { > struct timespec64 ts; > @@ -2460,7 +2520,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, > > iowq.timeout = timespec64_to_ktime(ts); > if (!(flags & IORING_ENTER_ABS_TIMER)) > - iowq.timeout = ktime_add(iowq.timeout, io_get_time(ctx)); > + iowq.timeout = ktime_add(iowq.timeout, start_time); > } > > if (ext_arg->sig) { > @@ -2484,14 +2544,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags, > unsigned long check_cq; > > if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) { > - atomic_set(&ctx->cq_wait_nr, nr_wait); > + /* if min timeout has been hit, don't reset wait count */ > + if (!READ_ONCE(iowq.hit_timeout)) > + atomic_set(&ctx->cq_wait_nr, nr_wait); Only the two timeout expired callback functions io_cqring_min_timer_wakeup() and io_cqring_timer_wakeup() sets hit_timeout to 1. In this case, io_cqring_schedule_timeout() would return -ETIME and the do {...} while(1) loop in io_cqring_wait() would break. So I'm not sure if it is possible to reach here with hit_timeout = 1. Also, in the first iteration of the loop, hit_timeout is init to 0 inside of io_cqring_wait_schedule() -> __io_cqring_wait_schedule() -> io_cqring_schedule_timeout(). So it is possible for hit_timeout to be READ_ONCE before it is initialised. If this code is kept we should init iowq.hit_timeout = 0 above. > set_current_state(TASK_INTERRUPTIBLE); > } else { > prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq, > TASK_INTERRUPTIBLE); > } > > - ret = io_cqring_wait_schedule(ctx, &iowq); > + ret = io_cqring_wait_schedule(ctx, &iowq, start_time); > __set_current_state(TASK_RUNNING); > atomic_set(&ctx->cq_wait_nr, IO_CQ_WAKE_INIT); > > diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h > index f95c1b080f4b..65078e641390 100644 > --- a/io_uring/io_uring.h > +++ b/io_uring/io_uring.h > @@ -39,8 +39,10 @@ struct io_wait_queue { > struct wait_queue_entry wq; > struct io_ring_ctx *ctx; > unsigned cq_tail; > + unsigned cq_min_tail; > unsigned nr_timeouts; > int hit_timeout; > + ktime_t min_timeout; > ktime_t timeout; > struct hrtimer t; >