On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > On 10/18/19 12:50 PM, Jann Horn wrote: > > On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@xxxxxxxxx> wrote: > >> On 10/18/19 12:06 PM, Jann Horn wrote: > >>> But actually, by the way: Is this whole files_struct thing creating a > >>> reference loop? The files_struct has a reference to the uring file, > >>> and the uring file has ACCEPT work that has a reference to the > >>> files_struct. If the task gets killed and the accept work blocks, the > >>> entire files_struct will stay alive, right? > >> > >> Yes, for the lifetime of the request, it does create a loop. So if the > >> application goes away, I think you're right, the files_struct will stay. > >> And so will the io_uring, for that matter, as we depend on the closing > >> of the files to do the final reap. > >> > >> Hmm, not sure how best to handle that, to be honest. We need some way to > >> break the loop, if the request never finishes. > > > > A wacky and dubious approach would be to, instead of taking a > > reference to the files_struct, abuse f_op->flush() to synchronously > > flush out pending requests with references to the files_struct... But > > it's probably a bad idea, given that in f_op->flush(), you can't > > easily tell which files_struct the close is coming from. I suppose you > > could keep a list of (fdtable, fd) pairs through which ACCEPT requests > > have come in and then let f_op->flush() probe whether the file > > pointers are gone from them... > > Got back to this after finishing the io-wq stuff, which we need for the > cancel. > > Here's an updated patch: > > http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 > > that seems to work for me (lightly tested), we correctly find and cancel > work that is holding on to the file table. > > The full series sits on top of my for-5.5/io_uring-wq branch, and can be > viewed here: > > http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test > > Let me know what you think! Ah, I didn't realize that the second argument to f_op->flush is a pointer to the files_struct. That's neat. Security: There is no guarantee that ->flush() will run after the last io_uring_enter() finishes. You can race like this, with threads A and B in one process and C in another one: A: sends uring fd to C via unix domain socket A: starts syscall io_uring_enter(fd, ...) A: calls fdget(fd), takes reference to file B: starts syscall close(fd) B: fd table entry is removed B: f_op->flush is invoked and finds no pending transactions B: syscall close() returns A: continues io_uring_enter(), grabbing current->files A: io_uring_enter() returns A and B: exit worker: use-after-free access to files_struct I think the solution to this would be (unless you're fine with adding some broad global read-write mutex) something like this in __io_queue_sqe(), where "fd" and "f" are the variables from io_uring_enter(), plumbed through the stack somehow: if (req->flags & REQ_F_NEED_FILES) { rcu_read_lock(); spin_lock_irq(&ctx->inflight_lock); if (fcheck(fd) == f) { list_add(&req->inflight_list, &ctx->inflight_list); req->work.files = current->files; ret = 0; } else { ret = -EBADF; } spin_unlock_irq(&ctx->inflight_lock); rcu_read_unlock(); if (ret) goto put_req; } Minor note: If a process uses dup() to duplicate the uring fd, then closes the duplicated fd, that will cause work cancellations - but I guess that's fine? Style nit: I find it a bit confusing to name both the list head and the list member heads "inflight_list". Maybe name them "inflight_list" and "inflight_entry", or something like that? Correctness: Why is the wait in io_uring_flush() TASK_INTERRUPTIBLE? Shouldn't it be TASK_UNINTERRUPTIBLE? If someone sends a signal to the task while it's at that schedule(), it's just going to loop back around and retry what it was doing already, right? Security + Correctness: If there is more than one io_wqe, it seems to me that io_uring_flush() calls io_wq_cancel_work(), which calls io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first request it looks at is pending. In that case, io_wq_cancel_work() will immediately return, and io_uring_flush() will also immediately return. It looks like any other requests will continue running?