On Fri, Feb 21, 2020 at 12:00 AM Jens Axboe <axboe@xxxxxxxxx> wrote: > On 2/20/20 3:23 PM, Jann Horn wrote: > > On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > >> On 2/20/20 3:02 PM, Jann Horn wrote: > >>> On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > >>>> For poll requests, it's not uncommon to link a read (or write) after > >>>> the poll to execute immediately after the file is marked as ready. > >>>> Since the poll completion is called inside the waitqueue wake up handler, > >>>> we have to punt that linked request to async context. This slows down > >>>> the processing, and actually means it's faster to not use a link for this > >>>> use case. [...] > >>>> -static void io_poll_trigger_evfd(struct io_wq_work **workptr) > >>>> +static void io_poll_task_func(struct callback_head *cb) > >>>> { > >>>> - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); > >>>> + struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work); > >>>> + struct io_kiocb *nxt = NULL; > >>>> > >>> [...] > >>>> + io_poll_task_handler(req, &nxt); > >>>> + if (nxt) > >>>> + __io_queue_sqe(nxt, NULL); > >>> > >>> This can now get here from anywhere that calls schedule(), right? > >>> Which means that this might almost double the required kernel stack > >>> size, if one codepath exists that calls schedule() while near the > >>> bottom of the stack and another codepath exists that goes from here > >>> through the VFS and again uses a big amount of stack space? This is a > >>> somewhat ugly suggestion, but I wonder whether it'd make sense to > >>> check whether we've consumed over 25% of stack space, or something > >>> like that, and if so, directly punt the request. [...] > >>> Also, can we recursively hit this point? Even if __io_queue_sqe() > >>> doesn't *want* to block, the code it calls into might still block on a > >>> mutex or something like that, at which point the mutex code would call > >>> into schedule(), which would then again hit sched_out_update() and get > >>> here, right? As far as I can tell, this could cause unbounded > >>> recursion. > >> > >> The sched_work items are pruned before being run, so that can't happen. > > > > And is it impossible for new ones to be added in the meantime if a > > second poll operation completes in the background just when we're > > entering __io_queue_sqe()? > > True, that can happen. > > I wonder if we just prevent the recursion whether we can ignore most > of it. Eg never process the sched_work list if we're not at the top > level, so to speak. > > This should also prevent the deadlock that you mentioned with FUSE > in the next email that just rolled in. But there the first ->read_iter could be from outside io_uring. So you don't just have to worry about nesting inside an already-running uring work; you also have to worry about nesting inside more or less anything else that might be holding mutexes. So I think you'd pretty much have to whitelist known-safe schedule() callers, or something like that. Taking a step back: Do you know why this whole approach brings the kind of performance benefit you mentioned in the cover letter? 4x is a lot... Is it that expensive to take a trip through the scheduler? I wonder whether the performance numbers for the echo test would change if you commented out io_worker_spin_for_work()...