On Thu, Feb 20, 2020 at 11:02 PM Jann Horn <jannh@xxxxxxxxxx> wrote: > On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > > > > For poll requests, it's not uncommon to link a read (or write) after > > the poll to execute immediately after the file is marked as ready. > > Since the poll completion is called inside the waitqueue wake up handler, > > we have to punt that linked request to async context. This slows down > > the processing, and actually means it's faster to not use a link for this > > use case. > > > > We also run into problems if the completion_lock is contended, as we're > > doing a different lock ordering than the issue side is. Hence we have > > to do trylock for completion, and if that fails, go async. Poll removal > > needs to go async as well, for the same reason. > > > > eventfd notification needs special case as well, to avoid stack blowing > > recursion or deadlocks. > > > > These are all deficiencies that were inherited from the aio poll > > implementation, but I think we can do better. When a poll completes, > > simply queue it up in the task poll list. When the task completes the > > list, we can run dependent links inline as well. This means we never > > have to go async, and we can remove a bunch of code associated with > > that, and optimizations to try and make that run faster. The diffstat > > speaks for itself. > [...] > > -static void io_poll_trigger_evfd(struct io_wq_work **workptr) > > +static void io_poll_task_func(struct callback_head *cb) > > { > > - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); > > + struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work); > > + struct io_kiocb *nxt = NULL; > > > [...] > > + io_poll_task_handler(req, &nxt); > > + if (nxt) > > + __io_queue_sqe(nxt, NULL); > > This can now get here from anywhere that calls schedule(), right? > Which means that this might almost double the required kernel stack > size, if one codepath exists that calls schedule() while near the > bottom of the stack and another codepath exists that goes from here > through the VFS and again uses a big amount of stack space? Oh, I think this also implies that any mutex reachable via any of the nonblocking uring ops nests inside any mutex under which we happen to schedule(), right? I wonder whether that's going to cause deadlocks... For example, FUSE's ->read_iter() can call fuse_direct_io(), which can call inode_lock() and then call fuse_sync_writes() under the inode lock, which can wait_event(), which can schedule(); and if uring then from schedule() calls ->read_iter() again, you could reach inode_lock() on the same inode again, causing a deadlock, I think?