On 2/19/19 9:12 AM, Jann Horn wrote: > On Mon, Feb 11, 2019 at 8:01 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >> We normally have to fget/fput for each IO we do on a file. Even with >> the batching we do, the cost of the atomic inc/dec of the file usage >> count adds up. >> >> This adds IORING_REGISTER_FILES, and IORING_UNREGISTER_FILES opcodes >> for the io_uring_register(2) system call. The arguments passed in must >> be an array of __s32 holding file descriptors, and nr_args should hold >> the number of file descriptors the application wishes to pin for the >> duration of the io_uring instance (or until IORING_UNREGISTER_FILES is >> called). >> >> When used, the application must set IOSQE_FIXED_FILE in the sqe->flags >> member. Then, instead of setting sqe->fd to the real fd, it sets sqe->fd >> to the index in the array passed in to IORING_REGISTER_FILES. >> >> Files are automatically unregistered when the io_uring instance is torn >> down. An application need only unregister if it wishes to register a new >> set of fds. >> >> Reviewed-by: Hannes Reinecke <hare@xxxxxxxx> >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >> --- > [...] >> @@ -1335,6 +1379,161 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, >> return READ_ONCE(ring->r.head) == READ_ONCE(ring->r.tail) ? ret : 0; >> } >> >> +static void __io_sqe_files_unregister(struct io_ring_ctx *ctx) >> +{ >> +#if defined(CONFIG_UNIX) >> + if (ctx->ring_sock) { >> + struct sock *sock = ctx->ring_sock->sk; >> + struct sk_buff *skb; >> + >> + while ((skb = skb_dequeue(&sock->sk_receive_queue)) != NULL) >> + kfree_skb(skb); >> + } >> +#else >> + int i; >> + >> + for (i = 0; i < ctx->nr_user_files; i++) >> + fput(ctx->user_files[i]); >> +#endif >> +} >> + >> +static int io_sqe_files_unregister(struct io_ring_ctx *ctx) >> +{ >> + if (!ctx->user_files) >> + return -ENXIO; >> + >> + __io_sqe_files_unregister(ctx); >> + kfree(ctx->user_files); >> + ctx->user_files = NULL; >> + return 0; >> +} >> + >> +#if defined(CONFIG_UNIX) >> +/* >> + * Ensure the UNIX gc is aware of our file set, so we are certain that >> + * the io_uring can be safely unregistered on process exit, even if we have >> + * loops in the file referencing. >> + */ > > I still don't get how this is supposed to work. Quoting from an > earlier version of the patch: > > |> I think the overall concept here is still broken: You're giving the > |> user_files to the GC, and I think the GC can drop their refcounts, but > |> I don't see you actually getting feedback from the GC anywhere that > |> would let the GC break your references? E.g. in io_prep_rw() you grab > |> file pointers from ctx->user_files after simply checking > |> ctx->nr_user_files, and there is no path from the GC that touches > |> those fields. As far as I can tell, the GC is just going to go through > |> unix_destruct_scm() and drop references on your files, causing > |> use-after-free. > |> > |> But the unix GC is complicated, and maybe I'm just missing something... > | > | Only when the skb is released, which is either done when the io_uring > | is torn down (and then definitely safe), or if the socket is released, > | which is again also at a safe time. > > I'll try to add inline comments on my understanding of the code, maybe > you can point out where exactly we're understanding it differently... > >> +static int __io_sqe_files_scm(struct io_ring_ctx *ctx, int nr, int offset) >> +{ >> + struct sock *sk = ctx->ring_sock->sk; >> + struct scm_fp_list *fpl; >> + struct sk_buff *skb; >> + int i; >> + >> + fpl = kzalloc(sizeof(*fpl), GFP_KERNEL); >> + if (!fpl) >> + return -ENOMEM; >> + > // here we allocate a new `skb` with ->users==1 >> + skb = alloc_skb(0, GFP_KERNEL); >> + if (!skb) { >> + kfree(fpl); >> + return -ENOMEM; >> + } >> + >> + skb->sk = sk; > // set the skb's destructor, invoked when ->users drops to 0; > // destructor drops file refcounts >> + skb->destructor = unix_destruct_scm; >> + >> + fpl->user = get_uid(ctx->user); >> + for (i = 0; i < nr; i++) { > // grab a reference to each file for the skb >> + fpl->fp[i] = get_file(ctx->user_files[i + offset]); >> + unix_inflight(fpl->user, fpl->fp[i]); >> + } >> + >> + fpl->max = fpl->count = nr; >> + UNIXCB(skb).fp = fpl; >> + refcount_add(skb->truesize, &sk->sk_wmem_alloc); > // put the skb in the sk_receive_queue, still with a refcount of 1. >> + skb_queue_head(&sk->sk_receive_queue, skb); >> + > // drop a reference from each file; after this, only the > skb owns references to files; > // the ctx->user_files entries borrow their lifetime from the skb >> + for (i = 0; i < nr; i++) >> + fput(fpl->fp[i]); >> + >> + return 0; >> +} > > So let's say you have a cyclic dependency where an io_uring points to > a unix domain socket, and the unix domain socket points back at the > uring. The last reference from outside the loop goes away when the > user closes the uring's fd, but the uring's busypolling kernel thread > is still running and busypolling for new submission queue entries. > > The GC can then come along and run scan_inflight(), detect that > ctx->ring_sock->sk->sk_receive_queue contains a reference to a unix > domain socket, and steal the skb (unlinking it from the ring_sock and > linking it into the hitlist): > > __skb_unlink(skb, &x->sk_receive_queue); > __skb_queue_tail(hitlist, skb); > > And then the hitlist will be processed by __skb_queue_purge(), > dropping the refcount of the skb from 1 to 0. At that point, the unix > domain socket can be freed, and you still have a pointer to it in > ctx->user_files. I've fixed this for the sq offload case. >> + >> +/* >> + * If UNIX sockets are enabled, fd passing can cause a reference cycle which >> + * causes regular reference counting to break down. We rely on the UNIX >> + * garbage collection to take care of this problem for us. >> + */ >> +static int io_sqe_files_scm(struct io_ring_ctx *ctx) >> +{ >> + unsigned left, total; >> + int ret = 0; >> + >> + total = 0; >> + left = ctx->nr_user_files; >> + while (left) { >> + unsigned this_files = min_t(unsigned, left, SCM_MAX_FD); >> + int ret; >> + >> + ret = __io_sqe_files_scm(ctx, this_files, total); >> + if (ret) >> + break; > > If we bail out in the middle of translating the ->user_files here, we > have to make sure that we both destroy the already-created SKBs and > drop our references on the files we haven't dealt with yet. Good catch, fixed. >> + left -= this_files; >> + total += this_files; >> + } >> + >> + return ret; >> +} >> +#else >> +static int io_sqe_files_scm(struct io_ring_ctx *ctx) >> +{ >> + return 0; >> +} >> +#endif >> + >> +static int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg, >> + unsigned nr_args) >> +{ >> + __s32 __user *fds = (__s32 __user *) arg; >> + int fd, ret = 0; >> + unsigned i; >> + >> + if (ctx->user_files) >> + return -EBUSY; >> + if (!nr_args) >> + return -EINVAL; >> + if (nr_args > IORING_MAX_FIXED_FILES) >> + return -EMFILE; >> + >> + ctx->user_files = kcalloc(nr_args, sizeof(struct file *), GFP_KERNEL); >> + if (!ctx->user_files) >> + return -ENOMEM; >> + >> + for (i = 0; i < nr_args; i++) { >> + ret = -EFAULT; >> + if (copy_from_user(&fd, &fds[i], sizeof(fd))) >> + break; >> + >> + ctx->user_files[i] = fget(fd); >> + >> + ret = -EBADF; >> + if (!ctx->user_files[i]) >> + break; > > Let's say we hit this error condition after N successful loop > iterations, on a kernel with CONFIG_UNIX. At that point, we've filled > N file pointers into ctx->user_files[], and we've incremented > ctx->nr_user_files up to N. Now we jump to the `if (ret)` branch, > which goes into io_sqe_files_unregister(); but that's going to attempt > to dequeue inflight files from ctx->ring_sock, so that's not going to > work. Fixed, thanks. >> + /* >> + * Don't allow io_uring instances to be registered. If UNIX >> + * isn't enabled, then this causes a reference cycle and this >> + * instance can never get freed. If UNIX is enabled we'll >> + * handle it just fine, but there's still no point in allowing >> + * a ring fd as it doesn't support regular read/write anyway. >> + */ >> + if (ctx->user_files[i]->f_op == &io_uring_fops) { >> + fput(ctx->user_files[i]); >> + break; >> + } >> + ctx->nr_user_files++; > > I don't see anything that can set ctx->nr_user_files back down to > zero; as far as I can tell, if you repeatedly register and unregister > a set of files, ctx->nr_user_files will just grow, and since it's used > as an upper bound for array accesses, that's bad. Fixed that one earlier. -- Jens Axboe