On 1/28/19 4:59 PM, Jann Horn wrote: > On Tue, Jan 29, 2019 at 12:47 AM Jens Axboe <axboe@xxxxxxxxx> wrote: >> On 1/28/19 3:32 PM, Jann Horn wrote: >>> On Mon, Jan 28, 2019 at 10:35 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>> The submission queue (SQ) and completion queue (CQ) rings are shared >>>> between the application and the kernel. This eliminates the need to >>>> copy data back and forth to submit and complete IO. >>>> >>>> IO submissions use the io_uring_sqe data structure, and completions >>>> are generated in the form of io_uring_sqe data structures. The SQ >>>> ring is an index into the io_uring_sqe array, which makes it possible >>>> to submit a batch of IOs without them being contiguous in the ring. >>>> The CQ ring is always contiguous, as completion events are inherently >>>> unordered, and hence any io_uring_cqe entry can point back to an >>>> arbitrary submission. >>>> >>>> Two new system calls are added for this: >>>> >>>> io_uring_setup(entries, params) >>>> Sets up a context for doing async IO. On success, returns a file >>>> descriptor that the application can mmap to gain access to the >>>> SQ ring, CQ ring, and io_uring_sqes. >>>> >>>> io_uring_enter(fd, to_submit, min_complete, flags, sigset, sigsetsize) >>>> Initiates IO against the rings mapped to this fd, or waits for >>>> them to complete, or both. The behavior is controlled by the >>>> parameters passed in. If 'to_submit' is non-zero, then we'll >>>> try and submit new IO. If IORING_ENTER_GETEVENTS is set, the >>>> kernel will wait for 'min_complete' events, if they aren't >>>> already available. It's valid to set IORING_ENTER_GETEVENTS >>>> and 'min_complete' == 0 at the same time, this allows the >>>> kernel to return already completed events without waiting >>>> for them. This is useful only for polling, as for IRQ >>>> driven IO, the application can just check the CQ ring >>>> without entering the kernel. >>>> >>>> With this setup, it's possible to do async IO with a single system >>>> call. Future developments will enable polled IO with this interface, >>>> and polled submission as well. The latter will enable an application >>>> to do IO without doing ANY system calls at all. >>>> >>>> For IRQ driven IO, an application only needs to enter the kernel for >>>> completions if it wants to wait for them to occur. >>>> >>>> Each io_uring is backed by a workqueue, to support buffered async IO >>>> as well. We will only punt to an async context if the command would >>>> need to wait for IO on the device side. Any data that can be accessed >>>> directly in the page cache is done inline. This avoids the slowness >>>> issue of usual threadpools, since cached data is accessed as quickly >>>> as a sync interface. >>>> >>>> Sample application: http://git.kernel.dk/cgit/fio/plain/t/io_uring.c >>> [...] >>>> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe, >>>> + bool force_nonblock) >>>> +{ >>>> + struct kiocb *kiocb = &req->rw; >>>> + int ret; >>>> + >>>> + kiocb->ki_filp = fget(sqe->fd); >>>> + if (unlikely(!kiocb->ki_filp)) >>>> + return -EBADF; >>>> + kiocb->ki_pos = sqe->off; >>>> + kiocb->ki_flags = iocb_flags(kiocb->ki_filp); >>>> + kiocb->ki_hint = ki_hint_validate(file_write_hint(kiocb->ki_filp)); >>>> + if (sqe->ioprio) { >>>> + ret = ioprio_check_cap(sqe->ioprio); >>>> + if (ret) >>>> + goto out_fput; >>>> + >>>> + kiocb->ki_ioprio = sqe->ioprio; >>>> + } else >>>> + kiocb->ki_ioprio = get_current_ioprio(); >>>> + >>>> + ret = kiocb_set_rw_flags(kiocb, sqe->rw_flags); >>>> + if (unlikely(ret)) >>>> + goto out_fput; >>>> + if (force_nonblock) { >>>> + kiocb->ki_flags |= IOCB_NOWAIT; >>>> + req->flags |= REQ_F_FORCE_NONBLOCK; >>>> + } >>>> + if (kiocb->ki_flags & IOCB_HIPRI) { >>>> + ret = -EINVAL; >>>> + goto out_fput; >>>> + } >>>> + >>>> + kiocb->ki_complete = io_complete_rw; >>>> + return 0; >>>> +out_fput: >>>> + fput(kiocb->ki_filp); >>>> + return ret; >>>> +} >>> [...] >>>> +static ssize_t io_read(struct io_kiocb *req, const struct io_uring_sqe *sqe, >>>> + bool force_nonblock) >>>> +{ >>>> + struct iovec inline_vecs[UIO_FASTIOV], *iovec = inline_vecs; >>>> + struct kiocb *kiocb = &req->rw; >>>> + struct iov_iter iter; >>>> + struct file *file; >>>> + ssize_t ret; >>>> + >>>> + ret = io_prep_rw(req, sqe, force_nonblock); >>>> + if (ret) >>>> + return ret; >>>> + file = kiocb->ki_filp; >>>> + >>>> + ret = -EBADF; >>>> + if (unlikely(!(file->f_mode & FMODE_READ))) >>>> + goto out_fput; >>>> + ret = -EINVAL; >>>> + if (unlikely(!file->f_op->read_iter)) >>>> + goto out_fput; >>>> + >>>> + ret = io_import_iovec(req->ctx, READ, sqe, &iovec, &iter); >>>> + if (ret) >>>> + goto out_fput; >>>> + >>>> + ret = rw_verify_area(READ, file, &kiocb->ki_pos, iov_iter_count(&iter)); >>>> + if (!ret) { >>>> + ssize_t ret2; >>>> + >>>> + /* Catch -EAGAIN return for forced non-blocking submission */ >>>> + ret2 = call_read_iter(file, kiocb, &iter); >>>> + if (!force_nonblock || ret2 != -EAGAIN) >>>> + io_rw_done(kiocb, ret2); >>>> + else >>>> + ret = -EAGAIN; >>>> + } >>>> + kfree(iovec); >>>> +out_fput: >>>> + if (unlikely(ret)) >>>> + fput(file); >>>> + return ret; >>>> +} >>> [...] >>>> +static int __io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req, >>>> + struct sqe_submit *s, bool force_nonblock) >>>> +{ >>>> + const struct io_uring_sqe *sqe = s->sqe; >>>> + ssize_t ret; >>>> + >>>> + if (unlikely(s->index >= ctx->sq_entries)) >>>> + return -EINVAL; >>>> + req->user_data = sqe->user_data; >>>> + >>>> + ret = -EINVAL; >>>> + switch (sqe->opcode) { >>>> + case IORING_OP_NOP: >>>> + ret = io_nop(req, sqe); >>>> + break; >>>> + case IORING_OP_READV: >>>> + ret = io_read(req, sqe, force_nonblock); >>>> + break; >>>> + case IORING_OP_WRITEV: >>>> + ret = io_write(req, sqe, force_nonblock); >>>> + break; >>>> + default: >>>> + ret = -EINVAL; >>>> + break; >>>> + } >>>> + >>>> + return ret; >>>> +} >>>> + >>>> +static void io_sq_wq_submit_work(struct work_struct *work) >>>> +{ >>>> + struct io_kiocb *req = container_of(work, struct io_kiocb, work); >>>> + struct sqe_submit *s = &req->submit; >>>> + u64 user_data = s->sqe->user_data; >>>> + struct io_ring_ctx *ctx = req->ctx; >>>> + mm_segment_t old_fs = get_fs(); >>>> + struct files_struct *old_files; >>>> + int ret; >>>> + >>>> + /* Ensure we clear previously set forced non-block flag */ >>>> + req->flags &= ~REQ_F_FORCE_NONBLOCK; >>>> + >>>> + old_files = current->files; >>>> + current->files = ctx->sqo_files; >>> >>> I think you're not supposed to twiddle with current->files without >>> holding task_lock(current). >> >> 'current' is the work queue item in this case, do we need to protect >> against anything else? I can add the locking around the assignments >> (both places). > > Stuff like proc_fd_link() uses get_files_struct(), which grabs a > reference to your current files_struct protected only by task_lock(); > and it doesn't use anything like READ_ONCE(), so even if the object > lifetime is not a problem, get_files_struct() could potentially crash > due to a double-read (reading task->files twice and assuming that the > result will be the same). As far as I can tell, this procfs code also > works on kernel threads. OK, that does make sense. I've added the locking. >>>> + if (!mmget_not_zero(ctx->sqo_mm)) { >>>> + ret = -EFAULT; >>>> + goto err; >>>> + } >>>> + >>>> + use_mm(ctx->sqo_mm); >>>> + set_fs(USER_DS); >>>> + >>>> + ret = __io_submit_sqe(ctx, req, s, false); >>>> + >>>> + set_fs(old_fs); >>>> + unuse_mm(ctx->sqo_mm); >>>> + mmput(ctx->sqo_mm); >>>> +err: >>>> + if (ret) { >>>> + io_cqring_add_event(ctx, user_data, ret, 0); >>>> + io_free_req(req); >>>> + } >>>> + current->files = old_files; >>>> +} >>> [...] >>>> +static int io_sq_offload_start(struct io_ring_ctx *ctx) >>>> +{ >>>> + int ret; >>>> + >>>> + ctx->sqo_mm = current->mm; >>> >>> What keeps this thing alive? >> >> I think we're deadling with the same thing as the files below, I'll >> defer to that. >> >>>> + /* >>>> + * This is safe since 'current' has the fd installed, and if that gets >>>> + * closed on exit, then fops->release() is invoked which waits for the >>>> + * async contexts to flush and exit before exiting. >>>> + */ >>>> + ret = -EBADF; >>>> + ctx->sqo_files = current->files; >>>> + if (!ctx->sqo_files) >>>> + goto err; >>> >>> That's gnarly. Adding Al Viro to the thread. >>> >>> I think you misunderstand the semantics of f_op->release. The ->flush >>> handler is invoked whenever a file descriptor is closed through >>> filp_close() (via deletion of the files_struct, sys_close(), >>> sys_dup2(), ...), so if you had used that one, _maybe_ this would >>> work. But the ->release handler only runs when the _last_ reference to >>> a struct file has been dropped - so you can, for example, fork() a >>> child, then exit() in the parent, and the ->release handler isn't >>> invoked. So I don't see how this can work. >> >> The anonfd is CLOEXEC. The idea is exactly that it only runs when the >> last reference to the file has been dropped. Not sure why you think I >> need ->flush() here? > > Can't I just use fcntl(fd, F_SETFD, fd, 0) to clear the CLOEXEC flag? > Or send the fd via SCM_RIGHTS? That would obviously be a problem... >>> But even if you had abused ->flush for this instead: close_files() >>> currently has a comment in it that claims that "this is the last >>> reference to the files structure"; this change would make that claim >>> untrue. >> >> Let me see if I can explain my intent better than that comment... We >> know the parent who set up the io_uring instance will be around for as >> long as io_uring instance persists. > > That's the part that I think is wrong: As far as I can tell, the > parent can go away and you won't notice. If that's the case, then the mm/files referencing needs to be looked over for sure. It's currently relying on the fact that the parent stays alive. If it can go away without ->release() being called, then we have issues. > Also, note that "the parent" is different things for ->files and ->mm. > You can have a multithreaded process whose threads don't have the same > ->files, or multiple process that share ->files without sharing ->mm, > ... Of course, I do realize that. >> When we are tearing down the >> io_uring, then we wait for any async contexts (like the one above) to >> exit. Once they are exited, it's safe to proceed with the exit and >> teardown ->files[]. > > But you only do that teardown on ->release, right? And ->release > doesn't have much to do with the process lifetime. Yes, only on ->relase(). -- Jens Axboe