On 10/24/19 5:13 PM, Jann Horn wrote: > On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@xxxxxxxxx> wrote: >> On 10/24/19 2:31 PM, Jann Horn wrote: >>> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>> On 10/18/19 12:50 PM, Jann Horn wrote: >>>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>>>> On 10/18/19 12:06 PM, Jann Horn wrote: >>>>>>> But actually, by the way: Is this whole files_struct thing creating a >>>>>>> reference loop? The files_struct has a reference to the uring file, >>>>>>> and the uring file has ACCEPT work that has a reference to the >>>>>>> files_struct. If the task gets killed and the accept work blocks, the >>>>>>> entire files_struct will stay alive, right? >>>>>> >>>>>> Yes, for the lifetime of the request, it does create a loop. So if the >>>>>> application goes away, I think you're right, the files_struct will stay. >>>>>> And so will the io_uring, for that matter, as we depend on the closing >>>>>> of the files to do the final reap. >>>>>> >>>>>> Hmm, not sure how best to handle that, to be honest. We need some way to >>>>>> break the loop, if the request never finishes. >>>>> >>>>> A wacky and dubious approach would be to, instead of taking a >>>>> reference to the files_struct, abuse f_op->flush() to synchronously >>>>> flush out pending requests with references to the files_struct... But >>>>> it's probably a bad idea, given that in f_op->flush(), you can't >>>>> easily tell which files_struct the close is coming from. I suppose you >>>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests >>>>> have come in and then let f_op->flush() probe whether the file >>>>> pointers are gone from them... >>>> >>>> Got back to this after finishing the io-wq stuff, which we need for the >>>> cancel. >>>> >>>> Here's an updated patch: >>>> >>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 >>>> >>>> that seems to work for me (lightly tested), we correctly find and cancel >>>> work that is holding on to the file table. >>>> >>>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be >>>> viewed here: >>>> >>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test >>>> >>>> Let me know what you think! >>> >>> Ah, I didn't realize that the second argument to f_op->flush is a >>> pointer to the files_struct. That's neat. >>> >>> >>> Security: There is no guarantee that ->flush() will run after the last >>> io_uring_enter() finishes. You can race like this, with threads A and >>> B in one process and C in another one: >>> >>> A: sends uring fd to C via unix domain socket >>> A: starts syscall io_uring_enter(fd, ...) >>> A: calls fdget(fd), takes reference to file >>> B: starts syscall close(fd) >>> B: fd table entry is removed >>> B: f_op->flush is invoked and finds no pending transactions >>> B: syscall close() returns >>> A: continues io_uring_enter(), grabbing current->files >>> A: io_uring_enter() returns >>> A and B: exit >>> worker: use-after-free access to files_struct >>> >>> I think the solution to this would be (unless you're fine with adding >>> some broad global read-write mutex) something like this in >>> __io_queue_sqe(), where "fd" and "f" are the variables from >>> io_uring_enter(), plumbed through the stack somehow: >>> >>> if (req->flags & REQ_F_NEED_FILES) { >>> rcu_read_lock(); >>> spin_lock_irq(&ctx->inflight_lock); >>> if (fcheck(fd) == f) { >>> list_add(&req->inflight_list, >>> &ctx->inflight_list); >>> req->work.files = current->files; >>> ret = 0; >>> } else { >>> ret = -EBADF; >>> } >>> spin_unlock_irq(&ctx->inflight_lock); >>> rcu_read_unlock(); >>> if (ret) >>> goto put_req; >>> } >> >> First of all, thanks for the thorough look at this! We already have f >> available here, it's req->file. And we just made a copy of the sqe, so >> we have sqe->fd available as well. I fixed this up. > > sqe->fd is the file descriptor we're doing I/O on, not the file > descriptor of the uring file, right? Same thing for req->file. This > check only detects whether the fd we're doing I/O on was closed, which > is irrelevant. Duh yes, I'm an idiot. Easily fixable, I'll update this for the ring fd. >>> Security + Correctness: If there is more than one io_wqe, it seems to >>> me that io_uring_flush() calls io_wq_cancel_work(), which calls >>> io_wqe_cancel_work(), which may return IO_WQ_CANCEL_OK if the first >>> request it looks at is pending. In that case, io_wq_cancel_work() will >>> immediately return, and io_uring_flush() will also immediately return. >>> It looks like any other requests will continue running? >> >> Ah good point, I missed that. We need to keep looping until we get >> NOTFOUND returned. Fixed as well. >> >> Also added cancellation if the task is going away. Here's the >> incremental patch, I'll resend with the full version. > [...] >> +static int io_uring_flush(struct file *file, void *data) >> +{ >> + struct io_ring_ctx *ctx = file->private_data; >> + >> + if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) >> + io_wq_cancel_all(ctx->io_wq); > > Looking at io_wq_cancel_all(), this will just send a signal to the > task without waiting for anything, right? Isn't that unsafe? Yes, that's a logic error, we should always do the io_uring_cancel_files(). Ala: io_uring_cancel_files(); if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) io_wq_cancel_all(ctx->io_wq); Thanks! -- Jens Axboe