On 11/6/19 2:54 PM, Pavel Begunkov wrote: > On 07/11/2019 00:31, Jens Axboe wrote: >> On 11/6/19 1:08 PM, Jens Axboe wrote: >>> On 11/6/19 12:51 PM, Jann Horn wrote: >>>> On Wed, Nov 6, 2019 at 5:23 PM Jens Axboe <axboe@xxxxxxxxx> wrote: >>>>> Currently we drop completion events, if the CQ ring is full. That's fine >>>>> for requests with bounded completion times, but it may make it harder to >>>>> use io_uring with networked IO where request completion times are >>>>> generally unbounded. Or with POLL, for example, which is also unbounded. >>>>> >>>>> This patch adds IORING_SETUP_CQ_NODROP, which changes the behavior a bit >>>>> for CQ ring overflows. First of all, it doesn't overflow the ring, it >>>>> simply stores backlog of completions that we weren't able to put into >>>>> the CQ ring. To prevent the backlog from growing indefinitely, if the >>>>> backlog is non-empty, we apply back pressure on IO submissions. Any >>>>> attempt to submit new IO with a non-empty backlog will get an -EBUSY >>>>> return from the kernel. >>>>> >>>>> I think that makes for a pretty sane API in terms of how the application >>>>> can handle it. With CQ_NODROP enabled, we'll never drop a completion >>>>> event (well unless we're totally out of memory...), but we'll also not >>>>> allow submissions with a completion backlog. >>>> [...] >>>>> +static void io_cqring_overflow(struct io_ring_ctx *ctx, u64 ki_user_data, >>>>> + long res) >>>>> + __must_hold(&ctx->completion_lock) >>>>> +{ >>>>> + struct cqe_drop *drop; >>>>> + >>>>> + if (!(ctx->flags & IORING_SETUP_CQ_NODROP)) { >>>>> +log_overflow: >>>>> + WRITE_ONCE(ctx->rings->cq_overflow, >>>>> + atomic_inc_return(&ctx->cached_cq_overflow)); >>>>> + return; >>>>> + } >>>>> + >>>>> + drop = kmalloc(sizeof(*drop), GFP_ATOMIC); >>>>> + if (!drop) >>>>> + goto log_overflow; >>>>> + >>>>> + drop->user_data = ki_user_data; >>>>> + drop->res = res; >>>>> + list_add_tail(&drop->list, &ctx->cq_overflow_list); >>>>> +} >>>> >>>> This could potentially consume moderately large amounts of atomic >>>> memory quickly and without any guarantee that the memory will be freed >>>> anytime soon, right? That seems moderately bad. Is there no way to >>>> e.g. pre-reserve memory for completion events, or something like that? >>> >>> As soon as there's even one entry in that backlog, the ring won't accept >>> anymore new IO. So I don't think it's a huge concern. If we pre-reserve, >>> we haven't really made much progress in making sure we don't drop events, >>> and we'll be tying up that memory all the time. >>> >>> The alternative, as Pavel also mentioned, is to re-use the io_kiocb >>> for this. But that'll tie up more memory, and it's a bit tricky with >>> the life times. Just because the request has completed doesn't mean >>> that someone isn't still holding a reference to it, and who knows >>> what they will do. >> >> OK, I took a stab at it, here's a brain dump of the "complications" >> >> 1) Some places now use __io_free_req() to drop both references, if we >> know we haven't issued a request yet. Needs double drop, not a big >> deal. >> 2) Some ordering changes between io_put_req() and the fill/add event >> logic. Again not a huge deal, easy to spot. >> 3) We have one failure case that does not have a request, exactly because >> we failed to allocate one. Don't look at that part in the below patch, >> I think what we should do here is just reserve a request for that case. >> It won't help with the submission, but it'll get it logged correctly >> for the overflow backlog. Any new submission can't proceed with that >> request in the overflow backlog anyway, so we need just the one. >> Not super pretty, but at least we can keep this out of the fast path, >> as the only one that will free this request is the overflow flush >> path. >> > > 2 (maybe partially) and 3 will hopefully be solved by the patchset > removing passing sqe_submit. I'll resend it in a minute. Please do, it'll definitely make a few things easier. Then I'll base the cleanup/prep patch on top of that, and then the backpressure patch. -- Jens Axboe