Re: [RFC] io_uring CQ ring backpressure

Jens Axboe <axboe@xxxxxxxxx> · Wed, 6 Nov 2019 12:43:24 -0700

On 11/6/19 12:12 PM, Pavel Begunkov wrote:
> On 06/11/2019 19:21, Jens Axboe wrote:
>> Currently we drop completion events, if the CQ ring is full. That's fine
>> for requests with bounded completion times, but it may make it harder to
>> use io_uring with networked IO where request completion times are
>> generally unbounded. Or with POLL, for example, which is also unbounded.
>>
>> This patch adds IORING_SETUP_CQ_NODROP, which changes the behavior a bit
>> for CQ ring overflows. First of all, it doesn't overflow the ring, it
>> simply stores backlog of completions that we weren't able to put into
>> the CQ ring. To prevent the backlog from growing indefinitely, if the
>> backlog is non-empty, we apply back pressure on IO submissions. Any
>> attempt to submit new IO with a non-empty backlog will get an -EBUSY
>> return from the kernel.
>>
>> I think that makes for a pretty sane API in terms of how the application
>> can handle it. With CQ_NODROP enabled, we'll never drop a completion
>> event (well unless we're totally out of memory...), but we'll also not
>> allow submissions with a completion backlog.
>>
>> ---
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index b647cdf0312c..12e9fe2479f4 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -207,6 +207,7 @@ struct io_ring_ctx {
>>   
>>   		struct list_head	defer_list;
>>   		struct list_head	timeout_list;
>> +		struct list_head	cq_overflow_list;
>>   
>>   		wait_queue_head_t	inflight_wait;
>>   	} ____cacheline_aligned_in_smp;
>> @@ -414,6 +415,7 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>>   
>>   	ctx->flags = p->flags;
>>   	init_waitqueue_head(&ctx->cq_wait);
>> +	INIT_LIST_HEAD(&ctx->cq_overflow_list);
>>   	init_completion(&ctx->ctx_done);
>>   	init_completion(&ctx->sqo_thread_started);
>>   	mutex_init(&ctx->uring_lock);
>> @@ -588,6 +590,77 @@ static struct io_uring_cqe *io_get_cqring(struct io_ring_ctx *ctx)
>>   	return &rings->cqes[tail & ctx->cq_mask];
>>   }
>>   
>> +static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
>> +{
>> +	if (waitqueue_active(&ctx->wait))
>> +		wake_up(&ctx->wait);
>> +	if (waitqueue_active(&ctx->sqo_wait))
>> +		wake_up(&ctx->sqo_wait);
>> +	if (ctx->cq_ev_fd)
>> +		eventfd_signal(ctx->cq_ev_fd, 1);
>> +}
>> +
>> +struct cqe_drop {
>> +	struct list_head list;
>> +	u64 user_data;
>> +	s32 res;
>> +};
> 
> How about to use io_kiocb instead of new structure?
> It already has valid req->user_data and occasionaly used
> req->result. But this would probably take more work to do.

I did consider that, we could use both those fields. But we can't just
take ownership of the io_kiocb at this point, so it's much easier (and
less error prone) to just stuff in this new structure. The downside is
of course that we could still have an overflow, but if an atomic
allocation of this size fails, then the system is hosed anyway.

This is also a somewhat more slow path. We don't really expect to have
constant overflows, but we need to be able to handle them. Otherwise CQ
ring sizing must be worst case, and maybe that isn't even enough. This
allows use cases that we could not support before.

>> @@ -3036,8 +3108,10 @@ static int io_sq_thread(void *data)
>>   		}
>>   
>>   		to_submit = min(to_submit, ctx->sq_entries);
>> -		inflight += io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm,
>> -					   true);
>> +		ret = io_submit_sqes(ctx, to_submit, NULL, -1, &cur_mm, true);
>> +		if (ret < 0)
>> +			continue;
>> +		inflight += ret;
>>   
> 
> After rebase could be simplified to
> if (ret >= 0)
> 	inflight += ret;

Heh, that's exactly that I did:

https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.5/io_uring-test&id=f491ca28c78b7ff2dcd288847a212bb49f256500

Thanks for the review!

-- 
Jens Axboe