On 2/22/20 7:41 AM, Jens Axboe wrote: > On 2/21/20 4:00 PM, Jann Horn wrote: >> On Fri, Feb 21, 2020 at 11:56 PM Jann Horn <jannh@xxxxxxxxxx> wrote: >>> On Fri, Feb 21, 2020 at 10:46 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>> For poll requests, it's not uncommon to link a read (or write) after >>>> the poll to execute immediately after the file is marked as ready. >>>> Since the poll completion is called inside the waitqueue wake up handler, >>>> we have to punt that linked request to async context. This slows down >>>> the processing, and actually means it's faster to not use a link for this >>>> use case. >>>> >>>> We also run into problems if the completion_lock is contended, as we're >>>> doing a different lock ordering than the issue side is. Hence we have >>>> to do trylock for completion, and if that fails, go async. Poll removal >>>> needs to go async as well, for the same reason. >>>> >>>> eventfd notification needs special case as well, to avoid stack blowing >>>> recursion or deadlocks. >>>> >>>> These are all deficiencies that were inherited from the aio poll >>>> implementation, but I think we can do better. When a poll completes, >>>> simply queue it up in the task poll list. When the task completes the >>>> list, we can run dependent links inline as well. This means we never >>>> have to go async, and we can remove a bunch of code associated with >>>> that, and optimizations to try and make that run faster. The diffstat >>>> speaks for itself. >>> [...] >>>> @@ -3637,8 +3587,8 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, >>>> { >>>> struct io_kiocb *req = wait->private; >>>> struct io_poll_iocb *poll = &req->poll; >>>> - struct io_ring_ctx *ctx = req->ctx; >>>> __poll_t mask = key_to_poll(key); >>>> + struct task_struct *tsk; >>>> >>>> /* for instances that support it check for an event match first: */ >>>> if (mask && !(mask & poll->events)) >>>> @@ -3646,46 +3596,11 @@ static int io_poll_wake(struct wait_queue_entry *wait, unsigned mode, int sync, >>>> >>>> list_del_init(&poll->wait.entry); >>>> >>> [...] >>>> + tsk = req->task; >>>> + req->result = mask; >>>> + init_task_work(&req->task_work, io_poll_task_func); >>>> + task_work_add(tsk, &req->task_work, true); >>>> + wake_up_process(tsk); >>>> return 1; >>>> } >>> >>> Let's say userspace has some code like this: >>> >>> [prepare two uring requests: one POLL and a RECVMSG linked behind it] >>> // submit requests >>> io_uring_enter(uring_fd, 2, 0, 0, NULL, 0); >>> // wait for something to happen, either a completion event from uring >>> or input from stdin >>> struct pollfd fds[] = { >>> { .fd = 0, .events = POLLIN }, >>> { .fd = uring_fd, .events = POLLIN } >>> }; >>> while (1) { >>> poll(fds, 2, -1); >>> if (fds[0].revents) { >>> [read stuff from stdin] >>> } >>> if (fds[1].revents) { >>> [fetch completions from shared memory] >>> } >>> } >>> >>> If userspace has reached the poll() by the time the uring POLL op >>> completes, I think you'll wake up the do_poll() loop while it is in >>> poll_schedule_timeout(); then it will do another iteration, see that >>> no signals are pending and none of the polled files have become ready, >>> and go to sleep again. So things are stuck until the io_uring fd >>> signals that it is ready. >>> >>> The options I see are: >>> >>> - Tell the kernel to go through signal delivery code, which I think >>> will cause the pending syscall to actually abort and return to >>> userspace (which I think is kinda gross). You could maybe add a >>> special case where that doesn't happen if the task is already in >>> io_uring_enter() and waiting for CQ events. >>> - Forbid eventfd notifications, ensure that the ring's ->poll handler >>> reports POLLIN when work items are pending for userspace, and then >>> rely on the fact that those work items will be picked up when >>> returning from the poll syscall. Unfortunately, this gets a bit messy >>> when you're dealing with multiple threads that access the same ring, >>> since then you'd have to ensure that *any* thread can pick up this >>> work, and that that doesn't mismatch how the uring instance is shared >>> between threads; but you could probably engineer your way around this. >>> For userspace, this whole thing just means "POLLIN may be spurious". >>> - Like the previous item, except you tell userspace that if it gets >>> POLLIN (or some special poll status like POLLRDBAND) and sees nothing >>> in the completion queue, it should call io_uring_enter() to process >>> the work. This addresses the submitter-is-not-completion-reaper >>> scenario without having to add some weird version of task_work that >>> will be processed by the first thread, but you'd get some extra >>> syscalls. >> >> ... or I suppose you could punt to worker context if anyone uses the >> ring's ->poll handler or has an eventfd registered, if you don't >> expect high-performance users to do those things. > > Good points, thanks Jann. We have some precedence in the area of > requiring the application to enter the kernel, that's how the CQ ring > overflow is handled as well. For liburing users, that'd be trivial to > hide, for the raw interface that's not necessarily the case. I'd hate to > make the feature opt-in rather than just generally available. > > I'll try and play with some ideas in this area and see how it falls out. I wonder if the below is enough - it'll trigger a poll and eventfd wakeup, if we add work. If current->task_works != NULL, we could also set POLLPRI to make it explicit why this happened, that seems like a better fit than POLLRDBAND. diff --git a/fs/io_uring.c b/fs/io_uring.c index 6f7ae3eab21f..dba2f0e1ae6a 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -3501,6 +3501,18 @@ static void io_async_queue_proc(struct file *file, struct wait_queue_head *head, __io_queue_proc(&pt->req->apoll->poll, pt, head); } +static void io_task_work_notify(struct io_ring_ctx *ctx, + struct task_struct *tsk) +{ + if (wq_has_sleeper(&ctx->cq_wait)) { + wake_up_interruptible(&ctx->cq_wait); + kill_fasync(&ctx->cq_fasync, SIGIO, POLL_IN); + } + if (ctx->cq_ev_fd) + eventfd_signal(ctx->cq_ev_fd, 0); + wake_up_process(tsk); +} + static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, __poll_t mask, task_work_func_t func) { @@ -3523,7 +3535,7 @@ static int __io_async_wake(struct io_kiocb *req, struct io_poll_iocb *poll, * don't need to check here and handle it specifically. */ task_work_add(tsk, &req->task_work, true); - wake_up_process(tsk); + io_task_work_notify(req->ctx, tsk); return 1; } @@ -6488,7 +6500,7 @@ static __poll_t io_uring_poll(struct file *file, poll_table *wait) if (READ_ONCE(ctx->rings->sq.tail) - ctx->cached_sq_head != ctx->rings->sq_ring_entries) mask |= EPOLLOUT | EPOLLWRNORM; - if (io_cqring_events(ctx, false)) + if (io_cqring_events(ctx, false) || current->task_works != NULL) mask |= EPOLLIN | EPOLLRDNORM; return mask; -- Jens Axboe