Re: [PATCH 10/10] io_uring: add support for ring mapped supplied buffers

Jens Axboe <axboe@xxxxxxxxx> · Fri, 29 Apr 2022 08:57:51 -0600

On 4/29/22 7:21 AM, Victor Stewart wrote:
> top posting because this is a tangential but related comment.
> 
> the way i manage memory in my network server is by initializing with a
> fixed maximum number of supported clients, and then mmap an enormous
> contiguous buffer of something like (100MB + 100MB) * nMaxClients, and
> then for each client assign a fixed 100MB range for receive and
> another for send.
> 
> then with transparent huge pages disabled, only the pages with bytes
> in them are ever resident, memset-ing bytes to 0 as they?re consumed
> by the send or receive paths.
> 
> so this provides a perfectly optimal deterministic memory
> architecture, which makes client memory management effortless, while
> costing nothing? without the hassle of recycling buffers or worrying
> about what range to recv into or write into.
> 
> but i know that registered buffers as is have some restriction on
> maximum number of bytes one can register (i forget exactly).

You can have 64K groups, and 64K buffers in each. Each buffer can be
INT_MAX.

> so maybe there?s some way in the future to accommodate this scheme as
> well, which i believe is optimal out of all options.

As you noted, this patch doesn't change how provided buffers work, it
merely changes the mechanism with which they can be provided and
consumed to be more efficient.

One idea that we have entertained internally is to allow incremental
consumption of a buffer. Let's assume your setup. I'm going to exclude
send as those aren't relevant for this discussion. This means you have
100MB of buffer space for receive per client. Each client would have a
buffer group ID associated with it, for their receive buffers. If you
know what size your receives will be, then you'd provide your 100MB in
chunks of that. Each receive would pick a chunk, recv data, then post
the completion that holds information on what buffer was picked for it.
When the client is done with the data, it puts it back into the provided
pool.

If you have wildly different receive sizes, and no idea how much you'd
get at any point in time, then this scheme doesn't work so well as you
have to then either do multiple receive requests to get all the data, or
size your chunks such that any receive will fit. Obviously that can be
wasteful, as you end up with fewer available chunks, and maybe you need
to throw more than 100MB at it at that point.

If we allowed incremental consumption, you could provide your 100MB as
just one chunk. When a recv request is posted for eg 1500 bytes, you'd
simply chop 1500 off the front of that buffer and use it. You're now
left with a single chunk that's 100MB-1500B in size.

One complication here is that we don't have enough room in the CQE to
tell the app where we consumed from. Hence we'd need to ensure that the
kernel and application agree on where data is consumed from for any
given receive. Given full ordering of completions wrt data receive, this
isn't impossible, but it does seem a bit fragile to me.

We do have pending patches that allow for bigger CQEs, with the initial
use case being the passthrough support for eg NVMe. With that, you have
two u64 extra fields for any CQE, if you configure your ring to use big
CQEs. With that, we could do incremental consumption and just have the
recv completion be:

cqe = {
	.user_data	/* whatever app set user_data to for recv */
	.res		/* bytes received */
	.flags		/* IORING_CQE_F_INC_BUFFER */
	.extra1		/* start address of where data landed */
	.extra2		/* still unused */
}

and the client now knows that data was received into the address at
.extra1 and of .res bytes in length. This would not supported vectored
recv, but that seems like a minor thing as you can just do big buffers.

This does suffer from the fragmentation issue again. Your case probably
does not as you have a group per client, but other use cases might have
shared groups.

That was a long winded way of saying that "yes this patch doesn't
fundamentally change how provided buffers work, it just makes it more
efficient to use and allows easy re-provide options that previously
made provided buffers too slow to use for some use cases".

I welcome feedback! It's not entirely clear to me what your suggestion
is, it looks more like you're describing your use cases and soliciting
ideas on how provided buffers could work better for that?

-- 
Jens Axboe