On 2024-08-20 14:31, Jens Axboe wrote: > On 8/20/24 3:10 PM, David Wei wrote: >> On 2024-08-19 16:28, Jens Axboe wrote: >>> Waiting for events with io_uring has two knobs that can be set: >>> >>> 1) The number of events to wake for >>> 2) The timeout associated with the event >>> >>> Waiting will abort when either of those conditions are met, as expected. >>> >>> This adds support for a third event, which is associated with the number >>> of events to wait for. Applications generally like to handle batches of >>> completions, and right now they'd set a number of events to wait for and >>> the timeout for that. If no events have been received but the timeout >>> triggers, control is returned to the application and it can wait again. >>> However, if the application doesn't have anything to do until events are >>> reaped, then it's possible to make this waiting more efficient. >>> >>> For example, the application may have a latency time of 50 usecs and >>> wanting to handle a batch of 8 requests at the time. If it uses 50 usecs >>> as the timeout, then it'll be doing 20K context switches per second even >>> if nothing is happening. >>> >>> This introduces the notion of min batch wait time. If the min batch wait >>> time expires, then we'll return to userspace if we have any events at all. >>> If none are available, the general wait time is applied. Any request >>> arriving after the min batch wait time will cause waiting to stop and >>> return control to the application. >> >> I think the batch request count should be applied to the min_timeout, >> such that: >> >> start_time min_timeout timeout >> |--------------------|--------------------| >> >> Return to user between [start_time, min_timeout) if there are wait_nr >> number of completions, checked by io_req_local_work_add(), or is it >> io_wake_function()? > > Right, if we get the batch fulfilled, we should ALWAYS return. > > If we have any events and min_timeout expires, return. > > If not, sleep the full timeout. > >> Return to user between [min_timeout, timeout) if there are at least one >> completion. > > Yes > >> Return to user at timeout always. > > Yes > > This should be how it works, and how I described it in the commit > message. > You're right, thanks. With DEFER_TASKRUN, the wakeup either happens in the timer expired callback io_cqring_min_timer_wakeup(), or in io_req_local_work_add(). In both cases control returns to after schedule() in io_cqring_schedule_timeout() and the timer is cancelled. Is it possible for the two to race at all?