On 10/24/19 6:35 PM, Jens Axboe wrote: > On 10/24/19 5:13 PM, Jann Horn wrote: >> On Fri, Oct 25, 2019 at 12:04 AM Jens Axboe <axboe@xxxxxxxxx> wrote: >>> On 10/24/19 2:31 PM, Jann Horn wrote: >>>> On Thu, Oct 24, 2019 at 9:41 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>>> On 10/18/19 12:50 PM, Jann Horn wrote: >>>>>> On Fri, Oct 18, 2019 at 8:16 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>>>>> On 10/18/19 12:06 PM, Jann Horn wrote: >>>>>>>> But actually, by the way: Is this whole files_struct thing creating a >>>>>>>> reference loop? The files_struct has a reference to the uring file, >>>>>>>> and the uring file has ACCEPT work that has a reference to the >>>>>>>> files_struct. If the task gets killed and the accept work blocks, the >>>>>>>> entire files_struct will stay alive, right? >>>>>>> >>>>>>> Yes, for the lifetime of the request, it does create a loop. So if the >>>>>>> application goes away, I think you're right, the files_struct will stay. >>>>>>> And so will the io_uring, for that matter, as we depend on the closing >>>>>>> of the files to do the final reap. >>>>>>> >>>>>>> Hmm, not sure how best to handle that, to be honest. We need some way to >>>>>>> break the loop, if the request never finishes. >>>>>> >>>>>> A wacky and dubious approach would be to, instead of taking a >>>>>> reference to the files_struct, abuse f_op->flush() to synchronously >>>>>> flush out pending requests with references to the files_struct... But >>>>>> it's probably a bad idea, given that in f_op->flush(), you can't >>>>>> easily tell which files_struct the close is coming from. I suppose you >>>>>> could keep a list of (fdtable, fd) pairs through which ACCEPT requests >>>>>> have come in and then let f_op->flush() probe whether the file >>>>>> pointers are gone from them... >>>>> >>>>> Got back to this after finishing the io-wq stuff, which we need for the >>>>> cancel. >>>>> >>>>> Here's an updated patch: >>>>> >>>>> http://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=1ea847edc58d6a54ca53001ad0c656da57257570 >>>>> >>>>> that seems to work for me (lightly tested), we correctly find and cancel >>>>> work that is holding on to the file table. >>>>> >>>>> The full series sits on top of my for-5.5/io_uring-wq branch, and can be >>>>> viewed here: >>>>> >>>>> http://git.kernel.dk/cgit/linux-block/log/?h=for-5.5/io_uring-test >>>>> >>>>> Let me know what you think! >>>> >>>> Ah, I didn't realize that the second argument to f_op->flush is a >>>> pointer to the files_struct. That's neat. >>>> >>>> >>>> Security: There is no guarantee that ->flush() will run after the last >>>> io_uring_enter() finishes. You can race like this, with threads A and >>>> B in one process and C in another one: >>>> >>>> A: sends uring fd to C via unix domain socket >>>> A: starts syscall io_uring_enter(fd, ...) >>>> A: calls fdget(fd), takes reference to file >>>> B: starts syscall close(fd) >>>> B: fd table entry is removed >>>> B: f_op->flush is invoked and finds no pending transactions >>>> B: syscall close() returns >>>> A: continues io_uring_enter(), grabbing current->files >>>> A: io_uring_enter() returns >>>> A and B: exit >>>> worker: use-after-free access to files_struct >>>> >>>> I think the solution to this would be (unless you're fine with adding >>>> some broad global read-write mutex) something like this in >>>> __io_queue_sqe(), where "fd" and "f" are the variables from >>>> io_uring_enter(), plumbed through the stack somehow: >>>> >>>> if (req->flags & REQ_F_NEED_FILES) { >>>> rcu_read_lock(); >>>> spin_lock_irq(&ctx->inflight_lock); >>>> if (fcheck(fd) == f) { >>>> list_add(&req->inflight_list, >>>> &ctx->inflight_list); >>>> req->work.files = current->files; >>>> ret = 0; >>>> } else { >>>> ret = -EBADF; >>>> } >>>> spin_unlock_irq(&ctx->inflight_lock); >>>> rcu_read_unlock(); >>>> if (ret) >>>> goto put_req; >>>> } >>> >>> First of all, thanks for the thorough look at this! We already have f >>> available here, it's req->file. And we just made a copy of the sqe, so >>> we have sqe->fd available as well. I fixed this up. >> >> sqe->fd is the file descriptor we're doing I/O on, not the file >> descriptor of the uring file, right? Same thing for req->file. This >> check only detects whether the fd we're doing I/O on was closed, which >> is irrelevant. > > Duh yes, I'm an idiot. Easily fixable, I'll update this for the ring fd. Incremental: diff --git a/fs/io_uring.c b/fs/io_uring.c index ec9dadfa90d2..4d94886a3d13 100644 --- a/fs/io_uring.c +++ b/fs/io_uring.c @@ -262,11 +262,13 @@ struct io_ring_ctx { struct sqe_submit { const struct io_uring_sqe *sqe; + struct file *ring_file; unsigned short index; bool has_user : 1; bool in_async : 1; bool needs_fixed_file : 1; u32 sequence; + int ring_fd; }; /* @@ -2329,14 +2331,13 @@ static int io_req_set_file(struct io_ring_ctx *ctx, const struct sqe_submit *s, return 0; } -static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req, - struct io_uring_sqe *sqe) +static int io_grab_files(struct io_ring_ctx *ctx, struct io_kiocb *req) { int ret = -EBADF; rcu_read_lock(); spin_lock_irq(&ctx->inflight_lock); - if (fcheck(sqe->fd) == req->file) { + if (fcheck(req->submit.ring_fd) == req->submit.ring_file) { list_add(&req->inflight_entry, &ctx->inflight_list); req->work.files = current->files; ret = 0; @@ -2367,7 +2368,7 @@ static int __io_queue_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, s->sqe = sqe_copy; memcpy(&req->submit, s, sizeof(*s)); if (req->flags & REQ_F_NEED_FILES) { - ret = io_grab_files(ctx, req, sqe_copy); + ret = io_grab_files(ctx, req); if (ret) { kfree(sqe_copy); goto err; @@ -2585,6 +2586,7 @@ static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) head = READ_ONCE(sq_array[head & ctx->sq_mask]); if (head < ctx->sq_entries) { + s->ring_file = NULL; s->index = head; s->sqe = &ctx->sq_sqes[head]; s->sequence = ctx->cached_sq_head; @@ -2782,7 +2784,8 @@ static int io_sq_thread(void *data) return 0; } -static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) +static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit, + struct file *ring_file, int ring_fd) { struct io_submit_state state, *statep = NULL; struct io_kiocb *link = NULL; @@ -2824,9 +2827,11 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit) } out: + s.ring_file = ring_file; s.has_user = true; s.in_async = false; s.needs_fixed_file = false; + s.ring_fd = ring_fd; submit++; trace_io_uring_submit_sqe(ctx, true, false); io_submit_sqe(ctx, &s, statep, &link); @@ -3828,10 +3833,9 @@ static int io_uring_flush(struct file *file, void *data) { struct io_ring_ctx *ctx = file->private_data; + io_uring_cancel_files(ctx, data); if (fatal_signal_pending(current) || (current->flags & PF_EXITING)) io_wq_cancel_all(ctx->io_wq); - else - io_uring_cancel_files(ctx, data); return 0; } @@ -3903,7 +3907,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit, to_submit = min(to_submit, ctx->sq_entries); mutex_lock(&ctx->uring_lock); - submitted = io_ring_submit(ctx, to_submit); + submitted = io_ring_submit(ctx, to_submit, f.file, fd); mutex_unlock(&ctx->uring_lock); } if (flags & IORING_ENTER_GETEVENTS) { -- Jens Axboe