On 11/6/19 12:51 PM, Jann Horn wrote: > On Wed, Nov 6, 2019 at 5:23 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >> Currently we drop completion events, if the CQ ring is full. That's fine >> for requests with bounded completion times, but it may make it harder to >> use io_uring with networked IO where request completion times are >> generally unbounded. Or with POLL, for example, which is also unbounded. >> >> This patch adds IORING_SETUP_CQ_NODROP, which changes the behavior a bit >> for CQ ring overflows. First of all, it doesn't overflow the ring, it >> simply stores backlog of completions that we weren't able to put into >> the CQ ring. To prevent the backlog from growing indefinitely, if the >> backlog is non-empty, we apply back pressure on IO submissions. Any >> attempt to submit new IO with a non-empty backlog will get an -EBUSY >> return from the kernel. >> >> I think that makes for a pretty sane API in terms of how the application >> can handle it. With CQ_NODROP enabled, we'll never drop a completion >> event (well unless we're totally out of memory...), but we'll also not >> allow submissions with a completion backlog. > [...] >> +static void io_cqring_overflow(struct io_ring_ctx *ctx, u64 ki_user_data, >> + long res) >> + __must_hold(&ctx->completion_lock) >> +{ >> + struct cqe_drop *drop; >> + >> + if (!(ctx->flags & IORING_SETUP_CQ_NODROP)) { >> +log_overflow: >> + WRITE_ONCE(ctx->rings->cq_overflow, >> + atomic_inc_return(&ctx->cached_cq_overflow)); >> + return; >> + } >> + >> + drop = kmalloc(sizeof(*drop), GFP_ATOMIC); >> + if (!drop) >> + goto log_overflow; >> + >> + drop->user_data = ki_user_data; >> + drop->res = res; >> + list_add_tail(&drop->list, &ctx->cq_overflow_list); >> +} > > This could potentially consume moderately large amounts of atomic > memory quickly and without any guarantee that the memory will be freed > anytime soon, right? That seems moderately bad. Is there no way to > e.g. pre-reserve memory for completion events, or something like that? As soon as there's even one entry in that backlog, the ring won't accept anymore new IO. So I don't think it's a huge concern. If we pre-reserve, we haven't really made much progress in making sure we don't drop events, and we'll be tying up that memory all the time. The alternative, as Pavel also mentioned, is to re-use the io_kiocb for this. But that'll tie up more memory, and it's a bit tricky with the life times. Just because the request has completed doesn't mean that someone isn't still holding a reference to it, and who knows what they will do. -- Jens Axboe