Re: [PATCH 7/9] io_uring: add per-task callback handler

Jann Horn <jannh@xxxxxxxxxx> · Thu, 20 Feb 2020 23:56:25 +0100

On Thu, Feb 20, 2020 at 11:02 PM Jann Horn <jannh@xxxxxxxxxx> wrote:
> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@xxxxxxxxx> wrote:
> >
> > For poll requests, it's not uncommon to link a read (or write) after
> > the poll to execute immediately after the file is marked as ready.
> > Since the poll completion is called inside the waitqueue wake up handler,
> > we have to punt that linked request to async context. This slows down
> > the processing, and actually means it's faster to not use a link for this
> > use case.
> >
> > We also run into problems if the completion_lock is contended, as we're
> > doing a different lock ordering than the issue side is. Hence we have
> > to do trylock for completion, and if that fails, go async. Poll removal
> > needs to go async as well, for the same reason.
> >
> > eventfd notification needs special case as well, to avoid stack blowing
> > recursion or deadlocks.
> >
> > These are all deficiencies that were inherited from the aio poll
> > implementation, but I think we can do better. When a poll completes,
> > simply queue it up in the task poll list. When the task completes the
> > list, we can run dependent links inline as well. This means we never
> > have to go async, and we can remove a bunch of code associated with
> > that, and optimizations to try and make that run faster. The diffstat
> > speaks for itself.
> [...]
> > -static void io_poll_trigger_evfd(struct io_wq_work **workptr)
> > +static void io_poll_task_func(struct callback_head *cb)
> >  {
> > -       struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work);
> > +       struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work);
> > +       struct io_kiocb *nxt = NULL;
> >
> [...]
> > +       io_poll_task_handler(req, &nxt);
> > +       if (nxt)
> > +               __io_queue_sqe(nxt, NULL);
>
> This can now get here from anywhere that calls schedule(), right?
> Which means that this might almost double the required kernel stack
> size, if one codepath exists that calls schedule() while near the
> bottom of the stack and another codepath exists that goes from here
> through the VFS and again uses a big amount of stack space?

Oh, I think this also implies that any mutex reachable via any of the
nonblocking uring ops nests inside any mutex under which we happen to
schedule(), right? I wonder whether that's going to cause deadlocks...

For example, FUSE's ->read_iter() can call fuse_direct_io(), which can
call inode_lock() and then call fuse_sync_writes() under the inode
lock, which can wait_event(), which can schedule(); and if uring then
from schedule() calls ->read_iter() again, you could reach
inode_lock() on the same inode again, causing a deadlock, I think?