On Thu, Feb 20, 2020 at 11:14 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > On 2/20/20 3:02 PM, Jann Horn wrote: > > On Thu, Feb 20, 2020 at 9:32 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > >> > >> For poll requests, it's not uncommon to link a read (or write) after > >> the poll to execute immediately after the file is marked as ready. > >> Since the poll completion is called inside the waitqueue wake up handler, > >> we have to punt that linked request to async context. This slows down > >> the processing, and actually means it's faster to not use a link for this > >> use case. > >> > >> We also run into problems if the completion_lock is contended, as we're > >> doing a different lock ordering than the issue side is. Hence we have > >> to do trylock for completion, and if that fails, go async. Poll removal > >> needs to go async as well, for the same reason. > >> > >> eventfd notification needs special case as well, to avoid stack blowing > >> recursion or deadlocks. > >> > >> These are all deficiencies that were inherited from the aio poll > >> implementation, but I think we can do better. When a poll completes, > >> simply queue it up in the task poll list. When the task completes the > >> list, we can run dependent links inline as well. This means we never > >> have to go async, and we can remove a bunch of code associated with > >> that, and optimizations to try and make that run faster. The diffstat > >> speaks for itself. > > [...] > >> -static void io_poll_trigger_evfd(struct io_wq_work **workptr) > >> +static void io_poll_task_func(struct callback_head *cb) > >> { > >> - struct io_kiocb *req = container_of(*workptr, struct io_kiocb, work); > >> + struct io_kiocb *req = container_of(cb, struct io_kiocb, sched_work); > >> + struct io_kiocb *nxt = NULL; > >> > > [...] > >> + io_poll_task_handler(req, &nxt); > >> + if (nxt) > >> + __io_queue_sqe(nxt, NULL); > > > > This can now get here from anywhere that calls schedule(), right? > > Which means that this might almost double the required kernel stack > > size, if one codepath exists that calls schedule() while near the > > bottom of the stack and another codepath exists that goes from here > > through the VFS and again uses a big amount of stack space? This is a > > somewhat ugly suggestion, but I wonder whether it'd make sense to > > check whether we've consumed over 25% of stack space, or something > > like that, and if so, directly punt the request. > > Right, it'll increase the stack usage. Not against adding some safe > guard that punts if we're too deep in, though I'd have to look how to > even do that... Looks like stack_not_used(), though it's not clear to me > how efficient that is? No, I don't think you want to do that... at least on X86-64, I think something vaguely like this should do the job: unsigned long cur_stack = (unsigned long)__builtin_frame_address(0); unsigned long begin = (unsigned long)task_stack_page(task); unsigned long end = (unsigned long)task_stack_page(task) + THREAD_SIZE; if (cur_stack < begin || cur_stack >= end || cur_stack < begin + THREAD_SIZE*3/4) [bailout] But since stacks grow in different directions depending on the architecture and so on, it might have to be an arch-specific thing... I'm not sure. > > Also, can we recursively hit this point? Even if __io_queue_sqe() > > doesn't *want* to block, the code it calls into might still block on a > > mutex or something like that, at which point the mutex code would call > > into schedule(), which would then again hit sched_out_update() and get > > here, right? As far as I can tell, this could cause unbounded > > recursion. > > The sched_work items are pruned before being run, so that can't happen. And is it impossible for new ones to be added in the meantime if a second poll operation completes in the background just when we're entering __io_queue_sqe()?