[please make sure linux-api and linux-man are CCed on new syscalls so that we get API experts to review them] > io_uring_enter(fd, to_submit, min_complete, flags) > Initiates IO against the rings mapped to this fd, or waits for > them to complete, or both. The behavior is controlled by the > parameters passed in. If 'to_submit' is non-zero, then we'll > try and submit new IO. If IORING_ENTER_GETEVENTS is set, the > kernel will wait for 'min_complete' events, if they aren't > already available. It's valid to set IORING_ENTER_GETEVENTS > and 'min_complete' == 0 at the same time, this allows the > kernel to return already completed events without waiting > for them. This is useful only for polling, as for IRQ > driven IO, the application can just check the CQ ring > without entering the kernel. Especially with poll support now in the series, don't we need a ѕigmask argument similar to pselect/ppoll/io_pgetevents now to deal with signal blocking during waiting for events? > +struct sqe_submit { > + const struct io_uring_sqe *sqe; > + unsigned index; > +}; Can you make sure all the structs use tab indentation for their field names? Maybe even the same for all structs just to be nice to my eyes? > +static int io_import_iovec(struct io_ring_ctx *ctx, int rw, > + const struct io_uring_sqe *sqe, > + struct iovec **iovec, struct iov_iter *iter) > +{ > + void __user *buf = u64_to_user_ptr(sqe->addr); > + > +#ifdef CONFIG_COMPAT > + if (ctx->compat) > + return compat_import_iovec(rw, buf, sqe->len, UIO_FASTIOV, > + iovec, iter); > +#endif I think we can just check in_compat_syscall() here, which means we can kill the ->compat member, and the separate compat version of the setup syscall. > +/* > + * IORING_OP_NOP just posts a completion event, nothing else. > + */ > +static int io_nop(struct io_kiocb *req, const struct io_uring_sqe *sqe) > +{ > + struct io_ring_ctx *ctx = req->ctx; > + > + __io_cqring_add_event(ctx, sqe->user_data, 0, 0); Can you explain why not taking the completion lock is safe here? And why we want to have such a somewhat dangerous special case just for the no-op benchmarking aid? > +static bool io_get_sqring(struct io_ring_ctx *ctx, struct sqe_submit *s) > +{ > + struct io_sq_ring *ring = ctx->sq_ring; > + unsigned head; > + > + head = ctx->cached_sq_head; > + smp_rmb(); > + if (head == READ_ONCE(ring->r.tail)) > + return false; Do we really need to optimize the sq_head == tail case so much? Or am I missing why we are using the cached sq head case here? Maybe add some more comments for a start. > +static int __io_uring_enter(struct io_ring_ctx *ctx, unsigned to_submit, > + unsigned min_complete, unsigned flags) > +{ > + int ret = 0; > + > + if (to_submit) { > + ret = io_ring_submit(ctx, to_submit); > + if (ret < 0) > + return ret; > + } > + if (flags & IORING_ENTER_GETEVENTS) { > + int get_ret; > + > + if (!ret && to_submit) > + min_complete = 0; Why do we have this special case? Does it need some documentation? > + > + get_ret = io_cqring_wait(ctx, min_complete); > + if (get_ret < 0 && !ret) > + ret = get_ret; > + } > + > + return ret; Maybe using different names and slightly different semantics for the return values would clear some of this up? if (to_submit) { submitted = io_ring_submit(ctx, to_submit); if (submitted < 0) return submitted; } if (flags & IORING_ENTER_GETEVENTS) { ... ret = io_cqring_wait(ctx, min_complete); } return submitted ? submitted : ret; > +static int io_sq_offload_start(struct io_ring_ctx *ctx) > +static void io_sq_offload_stop(struct io_ring_ctx *ctx) Can we just merge these two functions into the callers? Currently the flow is a little odd with these helpers that don't seem to be too clear about their responsibilities. > +static void io_free_scq_urings(struct io_ring_ctx *ctx) > +{ > + if (ctx->sq_ring) { > + page_frag_free(ctx->sq_ring); > + ctx->sq_ring = NULL; > + } > + if (ctx->sq_sqes) { > + page_frag_free(ctx->sq_sqes); > + ctx->sq_sqes = NULL; > + } > + if (ctx->cq_ring) { > + page_frag_free(ctx->cq_ring); > + ctx->cq_ring = NULL; > + } Why is this using the page_frag helpers? Also the callers just free these ctx structure, so there isn't much of a point zeroing them out. Also I'd be tempted to open code the freeing in io_allocate_scq_urings instead of caling the helper, which would avoid the NULL checks and make the error handling code a little more obvious. > + if (mutex_trylock(&ctx->uring_lock)) { > + ret = __io_uring_enter(ctx, to_submit, min_complete, flags); do we even need the separate __io_uring_enter helper? > +static void io_fill_offsets(struct io_uring_params *p) Do we really need this as a separate helper?