On 1/28/19 3:32 PM, Jann Horn wrote: > On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >> The submission queue (SQ) and completion queue (CQ) rings are shared >> between the application and the kernel. This eliminates the need to >> copy data back and forth to submit and complete IO. >> >> IO submissions use the io_uring_sqe data structure, and completions >> are generated in the form of io_uring_sqe data structures. The SQ >> ring is an index into the io_uring_sqe array, which makes it possible >> to submit a batch of IOs without them being contiguous in the ring. >> The CQ ring is always contiguous, as completion events are inherently >> unordered, and hence any io_uring_cqe entry can point back to an >> arbitrary submission. >> >> Two new system calls are added for this: >> >> io_uring_setup(entries, params) >> Sets up a context for doing async IO. On success, returns a file >> descriptor that the application can mmap to gain access to the >> SQ ring, CQ ring, and io_uring_sqes. >> >> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) >> Initiates IO against the rings mapped to this fd, or waits for >> them to complete, or both. The behavior is controlled by the >> parameters passed in. If 'to_submit' is non-zero, then we'll >> try and submit new IO. If IORING_ENTER_GETEVENTS is set, the >> kernel will wait for 'min_complete' events, if they aren't >> already available. It's valid to set IORING_ENTER_GETEVENTS >> and 'min_complete' == 0 at the same time, this allows the >> kernel to return already completed events without waiting >> for them. This is useful only for polling, as for IRQ >> driven IO, the application can just check the CQ ring >> without entering the kernel. >> >> With this setup, it's possible to do async IO with a single system >> call. Future developments will enable polled IO with this interface, >> and polled submission as well. The latter will enable an application >> to do IO without doing ANY system calls at all. >> >> For IRQ driven IO, an application only needs to enter the kernel for >> completions if it wants to wait for them to occur. >> >> Each io_uring is backed by a workqueue, to support buffered async IO >> as well. We will only punt to an async context if the command would >> need to wait for IO on the device side. Any data that can be accessed >> directly in the page cache is done inline. This avoids the slowness >> issue of usual threadpools, since cached data is accessed as quickly >> as a sync interface. >> >> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c > [...] >> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, >> + bool force_nonblock) >> +{ >> + struct kiocb *kiocb = &req->rw; >> + int ret; >> + >> + kiocb->ki_filp = fget(sqe->fd); >> + if (unlikely(!kiocb->ki_filp)) >> + return -EBADF; >> + kiocb->ki_pos = sqe->off; >> + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); >> + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); >> + if (sqe->ioprio) { >> + ret = ioprio_check_cap(sqe->ioprio); >> + if (ret) >> + goto out_fput; >> + >> + kiocb->ki_ioprio = sqe->ioprio; >> + } else >> + kiocb->ki_ioprio = get_current_ioprio(); >> + >> + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); >> + if (unlikely(ret)) >> + goto out_fput; >> + if (force_nonblock) { >> + kiocb->ki_flags |= IOCB_NOWAIT; >> + req->flags |= REQ_F_FORCE_NONBLOCK; >> + } >> + if (kiocb->ki_flags & IOCB_HIPRI) { >> + ret = -EINVAL; >> + goto out_fput; >> + } >> + >> + kiocb->ki_complete = io_complete_rw; >> + return 0; >> +out_fput: >> + fput(kiocb->ki_filp); >> + return ret; >> +} > [...] >> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, >> + bool force_nonblock) >> +{ >> + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; >> + struct kiocb *kiocb = &req->rw; >> + struct iov_iter iter; >> + struct file *file; >> + ssize_t ret; >> + >> + ret = io_prep_rw(req, sqe, force_nonblock); >> + if (ret) >> + return ret; >> + file = kiocb->ki_filp; >> + >> + ret = -EBADF; >> + if (unlikely(!(file->f_mode & FMODE_READ))) >> + goto out_fput; >> + ret = -EINVAL; >> + if (unlikely(!file->f_op->read_iter)) >> + goto out_fput; >> + >> + ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter); >> + if (ret) >> + goto out_fput; >> + >> + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); >> + if (!ret) { >> + ssize_t ret2; >> + >> + /* Catch -EAGAIN return for forced non-blocking submission */ >> + ret2 = call_read_iter(file, kiocb, &iter); >> + if (!force_nonblock || ret2 != -EAGAIN) >> + io_rw_done(kiocb, ret2); >> + else >> + ret = -EAGAIN; >> + } >> + kfree(iovec); >> +out_fput: >> + if (unlikely(ret)) >> + fput(file); >> + return ret; >> +} > [...] >> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, >> + struct sqe_submit *s, bool force_nonblock) >> +{ >> + const struct io_uring_sqe *sqe = s->sqe; >> + ssize_t ret; >> + >> + if (unlikely(s->index >= ctx->sq_entries)) >> + return -EINVAL; >> + req->user_data = sqe->user_data; >> + >> + ret = -EINVAL; >> + switch (sqe->opcode) { >> + case IORING_OP_NOP: >> + ret = io_nop(req, sqe); >> + break; >> + case IORING_OP_READV: >> + ret = io_read(req, sqe, force_nonblock); >> + break; >> + case IORING_OP_WRITEV: >> + ret = io_write(req, sqe, force_nonblock); >> + break; >> + default: >> + ret = -EINVAL; >> + break; >> + } >> + >> + return ret; >> +} >> + >> +static void io_sq_wq_submit_work(struct work_struct *work) >> +{ >> + struct io_kiocb *req = container_of(work, struct io_kiocb, work); >> + struct sqe_submit *s = &req->submit; >> + u64 user_data = s->sqe->user_data; >> + struct io_ring_ctx *ctx = req->ctx; >> + mm_segment_t old_fs = get_fs(); >> + struct files_struct *old_files; >> + int ret; >> + >> + /* Ensure we clear previously set forced non-block flag */ >> + req->flags &= ~REQ_F_FORCE_NONBLOCK; >> + >> + old_files = current->files; >> + current->files = ctx->sqo_files; > > I think you're not supposed to twiddle with current->files without > holding task_lock(current). 'current' is the work queue item in this case, do we need to protect against anything else? I can add the locking around the assignments (both places). >> + if (!mmget_not_zero(ctx->sqo_mm)) { >> + ret = -EFAULT; >> + goto err; >> + } >> + >> + use_mm(ctx->sqo_mm); >> + set_fs(USER_DS); >> + >> + ret = __io_submit_sqe(ctx, req, s, false); >> + >> + set_fs(old_fs); >> + unuse_mm(ctx->sqo_mm); >> + mmput(ctx->sqo_mm); >> +err: >> + if (ret) { >> + io_cqring_add_event(ctx, user_data, ret, 0); >> + io_free_req(req); >> + } >> + current->files = old_files; >> +} > [...] >> +static int io_sq_offload_start(struct io_ring_ctx *ctx) >> +{ >> + int ret; >> + >> + ctx->sqo_mm = current->mm; > > What keeps this thing alive? I think we're deadling with the same thing as the files below, I'll defer to that. >> + /* >> + * This is safe since 'current' has the fd installed, and if that gets >> + * closed on exit, then fops->release() is invoked which waits for the >> + * async contexts to flush and exit before exiting. >> + */ >> + ret = -EBADF; >> + ctx->sqo_files = current->files; >> + if (!ctx->sqo_files) >> + goto err; > > That's gnarly. Adding Al Viro to the thread. > > I think you misunderstand the semantics of f_op->release. The ->flush > handler is invoked whenever a file descriptor is closed through > filp_close() (via deletion of the files_struct, sys_close(), > sys_dup2(), ...), so if you had used that one, _maybe_ this would > work. But the ->release handler only runs when the _last_ reference to > a struct file has been dropped - so you can, for example, fork() a > child, then exit() in the parent, and the ->release handler isn't > invoked. So I don't see how this can work. The anonfd is CLOEXEC. The idea is exactly that it only runs when the last reference to the file has been dropped. Not sure why you think I need ->flush() here? > But even if you had abused ->flush for this instead: close_files() > currently has a comment in it that claims that "this is the last > reference to the files structure"; this change would make that claim > untrue. Let me see if I can explain my intent better than that comment... We know the parent who set up the io_uring instance will be around for as long as io_uring instance persists. When we are tearing down the io_uring, then we wait for any async contexts (like the one above) to exit. Once they are exited, it's safe to proceed with the exit and teardown ->files[]. That's the idea... Not trying to be clever, some of this dates back to the aio weirdness where it was impossible to have cross references like this, since it would lead to teardown deadlocks with how exit_aio() is used. I can probably grab a struct files reference above, but currently I don't see why it's needed. >> + /* Do QD, or 2 * CPUS, whatever is smallest */ >> + ctx->sqo_wq = alloc_workqueue("io_ring-wq", WQ_UNBOUND | WQ_FREEZABLE, >> + min(ctx->sq_entries - 1, 2 * num_online_cpus())); >> + if (!ctx->sqo_wq) { >> + ret = -ENOMEM; >> + goto err; >> + } >> + >> + return 0; >> +err: >> + if (ctx->sqo_files) >> + ctx->sqo_files = NULL; >> + ctx->sqo_mm = NULL; >> + return ret; >> +} > [...] >> +static const struct file_operations io_uring_fops = { >> + .release = io_uring_release, >> + .mmap = io_uring_mmap, >> + .poll = io_uring_poll, >> + .fasync = io_uring_fasync, >> +}; > [...] > -- Jens Axboe