Re: [PATCH 7/9] io_uring: add per-task callback handler

Jann Horn <jannh@xxxxxxxxxx> · Fri, 21 Feb 2020 00:12:15 +0100

On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@xxxxxxxxx> wrote:
> On 2/20/20 3:23 PM, Jann Horn wrote:
> > On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
> >> On 2/20/20 3:02 PM, Jann Horn wrote:
> >>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
> >>>> For poll requests, it's not uncommon to link a read (or write) after
> >>>> the poll to execute immediately after the file is marked as ready.
> >>>> Since the poll completion is called inside the waitqueue wake up handler,
> >>>> we have to punt that linked request to async context. This slows down
> >>>> the processing, and actually means it's faster to not use a link for this
> >>>> use case.
[...]
> >>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> >>>> +static void io_poll_task_func(struct callback_head *cb)
> >>>>  {
> >>>> -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> >>>> +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> >>>> +       struct io_kiocb *nxt = NULL;
> >>>>
> >>> [...]
> >>>> +       io_poll_task_handler(req, &nxt);
> >>>> +       if (nxt)
> >>>> +               __io_queue_sqe(nxt, NULL);
> >>>
> >>> This can now get here from anywhere that calls schedule(), right?
> >>> Which means that this might almost double the required kernel stack
> >>> size, if one codepath exists that calls schedule() while near the
> >>> bottom of the stack and another codepath exists that goes from here
> >>> through the VFS and again uses a big amount of stack space? This is a
> >>> somewhat ugly suggestion, but I wonder whether it'd make sense to
> >>> check whether we've consumed over 25% of stack space, or something
> >>> like that, and if so, directly punt the request.
[...]
> >>> Also, can we recursively hit this point? Even if __io_queue_sqe()
> >>> doesn't *want* to block, the code it calls into might still block on a
> >>> mutex or something like that, at which point the mutex code would call
> >>> into schedule(), which would then again hit sched_out_update() and get
> >>> here, right? As far as I can tell, this could cause unbounded
> >>> recursion.
> >>
> >> The sched_work items are pruned before being run, so that can't happen.
> >
> > And is it impossible for new ones to be added in the meantime if a
> > second poll operation completes in the background just when we're
> > entering __io_queue_sqe()?
>
> True, that can happen.
>
> I wonder if we just prevent the recursion whether we can ignore most
> of it. Eg never process the sched_work list if we're not at the top
> level, so to speak.
>
> This should also prevent the deadlock that you mentioned with FUSE
> in the next email that just rolled in.

But there the first ->read_iter could be from outside io_uring. So you
don't just have to worry about nesting inside an already-running uring
work; you also have to worry about nesting inside more or less
anything else that might be holding mutexes. So I think you'd pretty
much have to whitelist known-safe schedule() callers, or something
like that.

Taking a step back: Do you know why this whole approach brings the
kind of performance benefit you mentioned in the cover letter? 4x is a
lot... Is it that expensive to take a trip through the scheduler?
I wonder whether the performance numbers for the echo test would
change if you commented out io_worker_spin_for_work()...