On 11/6/19 12:12 PM, Pavel Begunkov wrote: > On 06/11/2019 19:21, Jens Axboe wrote: >> Currently we drop completion events, if the CQ ring is full. That's fine >> for requests with bounded completion times, but it may make it harder to >> use io_uring with networked IO where request completion times are >> generally unbounded. Or with POLL, for example, which is also unbounded. >> >> This patch adds IORING_SETUP_CQ_NODROP, which changes the behavior a bit >> for CQ ring overflows. First of all, it doesn't overflow the ring, it >> simply stores backlog of completions that we weren't able to put into >> the CQ ring. To prevent the backlog from growing indefinitely, if the >> backlog is non-empty, we apply back pressure on IO submissions. Any >> attempt to submit new IO with a non-empty backlog will get an -EBUSY >> return from the kernel. >> >> I think that makes for a pretty sane API in terms of how the application >> can handle it. With CQ_NODROP enabled, we'll never drop a completion >> event (well unless we're totally out of memory...), but we'll also not >> allow submissions with a completion backlog. >> >> --- >> >> diff --git a/fs/io_uring.c b/fs/io_uring.c >> index b647cdf0312c..12e9fe2479f4 100644 >> --- a/fs/io_uring.c >> +++ b/fs/io_uring.c >> @@ -207,6 +207,7 @@ struct io_ring_ctx { >> >> struct list_head defer_list; >> struct list_head timeout_list; >> + struct list_head cq_overflow_list; >> >> wait_queue_head_t inflight_wait; >> } ____cacheline_aligned_in_smp; >> @@ -414,6 +415,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) >> >> ctx->flags = p->flags; >> init_waitqueue_head(&ctx->cq_wait); >> + INIT_LIST_HEAD(&ctx->cq_overflow_list); >> init_completion(&ctx->ctx_done); >> init_completion(&ctx->sqo_thread_started); >> mutex_init(&ctx->uring_lock); >> @@ -588,6 +590,77 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx) >> return &rings->cqes[tail & ctx->cq_mask]; >> } >> >> +static void io_cqring_ev_posted(struct io_ring_ctx *ctx) >> +{ >> + if (waitqueue_active(&ctx->wait)) >> + wake_up(&ctx->wait); >> + if (waitqueue_active(&ctx->sqo_wait)) >> + wake_up(&ctx->sqo_wait); >> + if (ctx->cq_ev_fd) >> + eventfd_signal(ctx->cq_ev_fd, 1); >> +} >> + >> +struct cqe_drop { >> + struct list_head list; >> + u64 user_data; >> + s32 res; >> +}; > > How about to use io_kiocb instead of new structure? > It already has valid req->user_data and occasionaly used > req->result. But this would probably take more work to do. I did consider that, we could use both those fields. But we can't just take ownership of the io_kiocb at this point, so it's much easier (and less error prone) to just stuff in this new structure. The downside is of course that we could still have an overflow, but if an atomic allocation of this size fails, then the system is hosed anyway. This is also a somewhat more slow path. We don't really expect to have constant overflows, but we need to be able to handle them. Otherwise CQ ring sizing must be worst case, and maybe that isn't even enough. This allows use cases that we could not support before. >> @@ -3036,8 +3108,10 @@ static int io_sq_thread(void *data) >> } >> >> to_submit = min(to_submit, ctx->sq_entries); >> - inflight += io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, >> - true); >> + ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true); >> + if (ret < 0) >> + continue; >> + inflight += ret; >> > > After rebase could be simplified to > if (ret >= 0) > inflight += ret; Heh, that's exactly that I did: https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=f491ca28c78b7ff2dcd288847a212bb49f256500 Thanks for the review! -- Jens Axboe