On 11/9/19 5:33 AM, Pavel Begunkov wrote: > On 11/9/2019 3:25 PM, Pavel Begunkov wrote: >> On 11/7/2019 7:00 PM, Jens Axboe wrote: >>> Currently we drop completion events, if the CQ ring is full. That's fine >>> for requests with bounded completion times, but it may make it harder to >>> use io_uring with networked IO where request completion times are >>> generally unbounded. Or with POLL, for example, which is also unbounded. >>> >>> This patch adds IORING_SETUP_CQ_NODROP, which changes the behavior a bit >>> for CQ ring overflows. First of all, it doesn't overflow the ring, it >>> simply stores a backlog of completions that we weren't able to put into >>> the CQ ring. To prevent the backlog from growing indefinitely, if the >>> backlog is non-empty, we apply back pressure on IO submissions. Any >>> attempt to submit new IO with a non-empty backlog will get an -EBUSY >>> return from the kernel. This is a signal to the application that it has >>> backlogged CQ events, and that it must reap those before being allowed >>> to submit more IO.> >>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >>> --- >>> fs/io_uring.c | 103 ++++++++++++++++++++++++++++------ >>> include/uapi/linux/io_uring.h | 1 + >>> 2 files changed, 87 insertions(+), 17 deletions(-) >>> >>> +static void io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool force) >>> +{ >>> + struct io_rings *rings = ctx->rings; >>> + struct io_uring_cqe *cqe; >>> + struct io_kiocb *req; >>> + unsigned long flags; >>> + LIST_HEAD(list); >>> + >>> + if (list_empty_careful(&ctx->cq_overflow_list)) >>> + return; >>> + if (ctx->cached_cq_tail - READ_ONCE(rings->cq.head) == >>> + rings->cq_ring_entries) >>> + return; >>> + >>> + spin_lock_irqsave(&ctx->completion_lock, flags); >>> + >>> + while (!list_empty(&ctx->cq_overflow_list)) { >>> + cqe = io_get_cqring(ctx); >>> + if (!cqe && !force) >>> + break;> + >>> + req = list_first_entry(&ctx->cq_overflow_list, struct io_kiocb, >>> + list); >>> + list_move(&req->list, &list); >>> + if (cqe) { >>> + WRITE_ONCE(cqe->user_data, req->user_data); >>> + WRITE_ONCE(cqe->res, req->result); >>> + WRITE_ONCE(cqe->flags, 0); >>> + } >> >> Hmm, second thought. We should account overflow here. >> > Clarification: We should account overflow in case of (!cqe). > > i.e. > if (!cqe) { // else > WRITE_ONCE(ctx->rings->cq_overflow, > atomic_inc_return(&ctx->cached_cq_overflow)); > } Ah yes, we should, even if this is only the flush path. I'll send out a patch for that, unless you beat me to it. -- Jens Axboe