Re: [RFC 6/6] io_uring/notif: implement notification stacking

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 15 Apr 2024 19:51:43 +0100

On 4/15/24 16:15, Willem de Bruijn wrote:
Pavel Begunkov wrote:
On 4/14/24 18:10, Willem de Bruijn wrote:
Pavel Begunkov wrote:
The network stack allows only one ubuf_info per skb, and unlike
MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate
ubuf_info. That means that send requests can't reuse a previosly
allocated skb and need to get one more or more of new ones. That's fine
for large sends, but otherwise it would spam the stack with lots of skbs
carrying just a little data each.

Can you give a little context why each send request has to be a
separate ubuf_info?

This patch series aims to make that model more efficient. Would it be
possible to just change the model instead? I assume you tried that and
it proved unworkable, but is it easy to explain what the fundamental
blocker is?

The uapi is so that you get a buffer completion (analogous to what you
get with recv(MSG_ERRQUEUE)) for each send request. With that, for skb
to serve multiple send requests it'd need to store a list of completions
in some way.

I probably don't know the io_uring implementation well enough yet, so
take this with a huge grain of salt.

MSG_ZEROCOPY can generate completions for multiple send calls from a
single uarg, by virtue of completions being incrementing IDs.

Is there a fundamental reason why io_uring needs a 1:1 mapping between
request slots in the API and uarg in the datapath? 

That's an ABI difference. Where MSG_ZEROCOPY returns a range of bytes
for the user to look up which buffers now can be reused, io_uring posts
one completion per send request, and by request I mean an io_uring
way of doing sendmsg(2). Hence the 1:1 mapping of uargs (which post
that completion) to send zc requests.

IOW, and if MSG_ZEROCOPY's uarg tracks byte range it covers, then
io_uring needs to know all requests associated with it, which
is currently just one request because of the 1:1 mapping.

Or differently, is
there no trivial way to associate a range of completions with a single
uarg?

Quite non trivial without changing ABI, I'd say. And a ABI change
wouldn't be small and without pitfalls.

One could try to track sockets, have one "active" ubuf_info
per socket which all sends would use, and then eventually flush the
active ubuf so it can post completions and create a new one.

This is basically what MSG_ZEROCOPY does for TCP. It signals POLLERR
as soon as one completion arrives. Then when a process gets around to
calling MSG_ERRQUEUE, it returns the range of completions that have
arrived in the meantime. A process can thus decide to postpone
completion handling to increase batching.

Yes, there is that on the completion side, but in the submission
you need to also decide when to let the current uarg go and
allocate a new one. Not an issue if uarg is owned by the TCP
stack, you don't have to additionally reference it, you know
when you empty the queue and all that. Not that great if
io_uring needs to talk to socket to understand when uarg is
better to be dropped.

but io_uring
wouldn't know when it needs to "flush", whenever in the net stack it
happens naturally when it pushes skbs from the queue. Not to mention
that socket tracking has its own complications.

As for uapi, in early versions of io_uring's SEND_ZC, ubuf_info and
requests weren't entangled, roughly speaking, the user could choose
that this request should use this ubuf_info (I can elaborate if
interesting). It wasn't too complex, but all feedback was pointing
that it's much easier to use hot it is now, and honestly it does
buy with simplicity.

I see. I suppose that answers the 1:1 mapping the ABI question I
asked above. I should reread that patch.

I'm not sure what a different model would give. We wouldn't win
in efficiency comparing to this patch, I can go into details
how there are no extra atomics/locks/kmalloc/etc., the only bit
is waking up waiting tasks, but that still would need to happen.
I can even optimise / ammortise ubuf refcounting if that would
matter.

Slight aside: we know that MSG_ZEROCOPY is quite inefficient for
small sends. Very rough rule of thumb is you need around 16KB or
larger sends for it to outperform regular copy. Part of that is the
memory pinning. The other part is the notification handling.
MSG_ERRQUEUE is expensive. I hope that io_uring cannot just match, but
improve on MSG_ZEROCOPY, especially for smaller packets.

I has some numbers left from this patchset benchmarking. Not too
well suited to answer your question, but still gives an idea.
Just a benchmark, single buffer, 100g broadcom NIC IIRC. All is
io_uring based, -z<bool> switches copy vs zerocopy. Zero copy
uses registered buffers, so no page pinning and page table
traversal at runtime. 10s per run is not ideal, but was matching
longer runs.

# 1200 bytes
./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s1200 -z0
packets=15004160 (MB=17170), rps=1470996 (MB/s=1683)
./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s1200 -z1
packets=10440224 (MB=11947), rps=1023551 (MB/s=1171)

# 4000 bytes
./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s4000 -z0
packets=11742688 (MB=44794), rps=1151243 (MB/s=4391)
./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s4000 -z1
packets=14144048 (MB=53955), rps=1386671 (MB/s=5289)

# 8000 bytes
./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s8000 -z0
packets=6868976 (MB=52406), rps=673429 (MB/s=5137)
./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s8000 -z1
packets=10800784 (MB=82403), rps=1058900 (MB/s=8078)

--
Pavel Begunkov