Re: [PATCH 5/7] io_uring: add per-task callback handler

Jens Axboe <axboe@xxxxxxxxx> · Sat, 22 Feb 2020 20:30:56 -0700

On 2/22/20 11:54 AM, Jens Axboe wrote:
> On 2/22/20 11:49 AM, Jens Axboe wrote:
>> On 2/22/20 8:09 AM, Jens Axboe wrote:
>>> On 2/22/20 7:41 AM, Jens Axboe wrote:
>>>> On 2/21/20 4:00 PM, Jann Horn wrote:
>>>>> On Fri, Feb 21, 2020 at 11:56 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
>>>>>> On Fri, Feb 21, 2020 at 10:46 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>>>>>> For poll requests, it's not uncommon to link a read (or write) after
>>>>>>> the poll to execute immediately after the file is marked as ready.
>>>>>>> Since the poll completion is called inside the waitqueue wake up handler,
>>>>>>> we have to punt that linked request to async context. This slows down
>>>>>>> the processing, and actually means it's faster to not use a link for this
>>>>>>> use case.
>>>>>>>
>>>>>>> We also run into problems if the completion_lock is contended, as we're
>>>>>>> doing a different lock ordering than the issue side is. Hence we have
>>>>>>> to do trylock for completion, and if that fails, go async. Poll removal
>>>>>>> needs to go async as well, for the same reason.
>>>>>>>
>>>>>>> eventfd notification needs special case as well, to avoid stack blowing
>>>>>>> recursion or deadlocks.
>>>>>>>
>>>>>>> These are all deficiencies that were inherited from the aio poll
>>>>>>> implementation, but I think we can do better. When a poll completes,
>>>>>>> simply queue it up in the task poll list. When the task completes the
>>>>>>> list, we can run dependent links inline as well. This means we never
>>>>>>> have to go async, and we can remove a bunch of code associated with
>>>>>>> that, and optimizations to try and make that run faster. The diffstat
>>>>>>> speaks for itself.
>>>>>> [...]
>>>>>>> @@ -3637,8 +3587,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>>>>>  {
>>>>>>>         struct io_kiocb *req = wait->private;
>>>>>>>         struct io_poll_iocb *poll = &req->poll;
>>>>>>> -       struct io_ring_ctx *ctx = req->ctx;
>>>>>>>         __poll_t mask = key_to_poll(key);
>>>>>>> +       struct task_struct *tsk;
>>>>>>>
>>>>>>>         /* for instances that support it check for an event match first: */
>>>>>>>         if (mask && !(mask & poll->events))
>>>>>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync,
>>>>>>>
>>>>>>>         list_del_init(&poll->wait.entry);
>>>>>>>
>>>>>> [...]
>>>>>>> +       tsk = req->task;
>>>>>>> +       req->result = mask;
>>>>>>> +       init_task_work(&req->task_work, io_poll_task_func);
>>>>>>> +       task_work_add(tsk, &req->task_work, true);
>>>>>>> +       wake_up_process(tsk);
>>>>>>>         return 1;
>>>>>>>  }
>>>>>>
>>>>>> Let's say userspace has some code like this:
>>>>>>
>>>>>> [prepare two uring requests: one POLL and a RECVMSG linked behind it]
>>>>>> // submit requests
>>>>>> io_uring_enter(uring_fd, 2, 0, 0, NULL, 0);
>>>>>> // wait for something to happen, either a completion event from uring
>>>>>> or input from stdin
>>>>>> struct pollfd fds[] = {
>>>>>>   { .fd = 0, .events = POLLIN },
>>>>>>   { .fd = uring_fd, .events = POLLIN }
>>>>>> };
>>>>>> while (1) {
>>>>>>   poll(fds, 2, -1);
>>>>>>   if (fds[0].revents) {
>>>>>>     [read stuff from stdin]
>>>>>>   }
>>>>>>   if (fds[1].revents) {
>>>>>>     [fetch completions from shared memory]
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> If userspace has reached the poll() by the time the uring POLL op
>>>>>> completes, I think you'll wake up the do_poll() loop while it is in
>>>>>> poll_schedule_timeout(); then it will do another iteration, see that
>>>>>> no signals are pending and none of the polled files have become ready,
>>>>>> and go to sleep again. So things are stuck until the io_uring fd
>>>>>> signals that it is ready.
>>>>>>
>>>>>> The options I see are:
>>>>>>
>>>>>>  - Tell the kernel to go through signal delivery code, which I think
>>>>>> will cause the pending syscall to actually abort and return to
>>>>>> userspace (which I think is kinda gross). You could maybe add a
>>>>>> special case where that doesn't happen if the task is already in
>>>>>> io_uring_enter() and waiting for CQ events.
>>>>>>  - Forbid eventfd notifications, ensure that the ring's ->poll handler
>>>>>> reports POLLIN when work items are pending for userspace, and then
>>>>>> rely on the fact that those work items will be picked up when
>>>>>> returning from the poll syscall. Unfortunately, this gets a bit messy
>>>>>> when you're dealing with multiple threads that access the same ring,
>>>>>> since then you'd have to ensure that *any* thread can pick up this
>>>>>> work, and that that doesn't mismatch how the uring instance is shared
>>>>>> between threads; but you could probably engineer your way around this.
>>>>>> For userspace, this whole thing just means "POLLIN may be spurious".
>>>>>>  - Like the previous item, except you tell userspace that if it gets
>>>>>> POLLIN (or some special poll status like POLLRDBAND) and sees nothing
>>>>>> in the completion queue, it should call io_uring_enter() to process
>>>>>> the work. This addresses the submitter-is-not-completion-reaper
>>>>>> scenario without having to add some weird version of task_work that
>>>>>> will be processed by the first thread, but you'd get some extra
>>>>>> syscalls.
>>>>>
>>>>> ... or I suppose you could punt to worker context if anyone uses the
>>>>> ring's ->poll handler or has an eventfd registered, if you don't
>>>>> expect high-performance users to do those things.
>>>>
>>>> Good points, thanks Jann. We have some precedence in the area of
>>>> requiring the application to enter the kernel, that's how the CQ ring
>>>> overflow is handled as well. For liburing users, that'd be trivial to
>>>> hide, for the raw interface that's not necessarily the case. I'd hate to
>>>> make the feature opt-in rather than just generally available.
>>>>
>>>> I'll try and play with some ideas in this area and see how it falls out.
>>>
>>> I wonder if the below is enough - it'll trigger a poll and eventfd
>>> wakeup, if we add work. If current->task_works != NULL, we could also
>>> set POLLPRI to make it explicit why this happened, that seems like a
>>> better fit than POLLRDBAND.
>>
>> I guess we still need a way to ensure that the task is definitely run.
>> The app could still just do peeks and decided nothing is there, then
>> poll again (or eventfd wait again).
>>
>> I wonder how well it'd work to simply have a timer, with some
>> arbitrarily selected timeout, that we arm (if it isn't already) when
>> task work is queued. If the timer expires and there's task work pending,
>> we grab it and run it async. That'd work for any case without needing
>> any action on the application side, the downside would be that it'd be
>> slow if the application assumes it can just do ring peeks, it'd only
>> help as a last resort "ensure this work is definitely run" kind of
>> thing. So we'd have this weird corner case of "things work, but why is
>> it running slower than it should". Seems that the likelihood of never
>> entering the kernel is low, but it's not impossible at all.
> 
> And I'll keep musing... I wonder if something ala:
> 
> if (task_currently_in_kernel)
>     task_work_add();
> else
>     punt async
> 
> would be good enough, done in the proper order to avoid races, of
> course. If the usual case is task in io_cqring_wait(), then that'd work.
> It means async for the polled/eventfd case always, though, unless the
> task happens to be in the kernel waiting on that side.
> 
> Also requires some means to check if the task is in the kernel
> currently, haven't checked if we already have something that'd tell us
> that.

Just on a flight, took a closer look at the task_work. Do we really need
anything else? If 'notify' is set, then we call set_notify_resume(),
which in turn sets TIF_NOTIFY_RESUME and kicks the process. That should
go through signal handling (like TIF_SIGPENDING), regardless of whether
or not the task is currently in the kernel.

Unless I'm missing something, looks to me like we have everything we
need. I even wrote a sample small tester that submits IO and never
enters the kernel deliberately, and we peek events just fine. If we do,
then the notification side should just as well.

-- 
Jens Axboe