Re: [PATCH 6/8] io_uring/net: support multishot for send

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 26 Feb 2024 20:51:15 +0000

On 2/26/24 20:12, Jens Axboe wrote:
On 2/26/24 12:21 PM, Pavel Begunkov wrote:
On 2/26/24 19:11, Jens Axboe wrote:
On 2/26/24 8:41 AM, Pavel Begunkov wrote:
On 2/26/24 15:16, Jens Axboe wrote:
On 2/26/24 7:36 AM, Pavel Begunkov wrote:
On 2/26/24 14:27, Jens Axboe wrote:
On 2/26/24 7:02 AM, Dylan Yudaken wrote:
On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
...
I don't think that's true - if you're doing large streaming, you're
more likely to keep the socket buffer full, whereas for smallish
sends, it's less likely to be full. Testing with the silly proxy
confirms that. And

I don't see any contradiction to what I said. With streaming/large
sends it's more likely to be polled. For small sends and
send-receive-send-... patterns the sock queue is unlikely to be full,
in which case the send is processed inline, and so the feature
doesn't add performance, as you agreed a couple email before.

Gotcha, I guess I misread you, we agree that the poll side is more
likely on bigger buffers.

outside of cases where pacing just isn't feasible, it's extra
overhead for cases where you potentially could or what.

I lost it, what overhead?

Overhead of needing to serialize the sends in the application, which may
include both extra memory needed and overhead in dealing with it.

I think I misread the code. Does it push 1 request for each
send buffer / queue_send() in case of provided bufs?

Anyway, the overhead of serialisation would be negligent.
And that's same extra memory you keep for the provided buffer
pool, and you can allocate it once. Also consider that provided
buffers are fixed size and it'd be hard to resize without waiting,
thus the userspace would still need to have another, userspace
backlog, it can't just drop requests. Or you make provided queues
extra large, but it's per socket and you'd wasting lots of memory.

IOW, I don't think this overhead could anyhow close us to
the understanding of the 30%+ perf gap.

To me, the main appeal of this is the simplicity.

I'd argue it doesn't seem any simpler than the alternative.

It's certainly simpler for an application to do "add buffer to queue"
and not need to worry about managing sends, than it is to manage a
backlog of only having a single send active.

They still need to manage / re-queue sends. And maybe I
misunderstand the point, but it's only one request inflight
per socket in either case.

serialize sends. Using provided buffers makes this very easy,
as you don't need to care about it at all, and it eliminates
complexity in the application dealing with this.

If I'm correct the example also serialises sends(?). I don't
think it's that simpler. You batch, you send. Same with this, but
batch into a provided buffer and the send is conditional.

Do you mean the proxy example? Just want to be sure we're talking
about

Yes, proxy, the one you referenced in the CV. And FWIW, I don't think
it's a fair comparison without batching followed by multi-iov.

It's not about vectored vs non-vectored IO, though you could of course
need to allocate an arbitrarily sized iovec that you can append to. And
now you need to use sendmsg rather than just send, which has further
overhead on top of send.

That's not nearly enough of overhead to explain the difference,
I don't believe so, going through the net stack is quite expensive.

What kind of batching? The batching done by the tests are the same,
regardless of whether or not send multishot is used in the sense that we

You can say that, but I say that it moves into the kernel
batching that can be implemented in userspace.

wait on the same number of completions. As it's a basic proxy kind of
thing, it'll receive a packet and send a packet. Submission batching is
the same too, we'll submit when we have to.

"If you actually need to poll tx, you send a request and collect
data into iov in userspace in background. When the request
completes you send all that in batch..."

That's how it's in Thrift for example.

In terms of "proxy", the first approximation would be to
do sth like defer_send() for normal requests as well, then

static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
			 void *data, int bid, int len)
{
	...

	defer_send(data);

	while (buf = defer_backlog.get()) {
		iov[idx++] = buf;
	}
	msghdr->iovlen = idx;
	...
}

the same thing. Yes it has to serialize sends, because otherwise we
can run into the condition described in the patch that adds
provided buffer support for send. But I did bench multishot
separately from there, here's some of it:

10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte
packets. Send ring and send multishot not used:

Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
===================================================== 1000   |
No       |  No   |   437  | 1.22M | 9598M 32     |    No       |
No   |  5856  | 2.87M |  734M

Same test, now turn on send ring:

Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
=========================================================== 1000
|    Yes       |  No   |   436  | 1.23M | 9620M | + 0.2% 32     |
Yes       |  No   |  3462  | 4.85M | 1237M | +68.5%

Same test, now turn on send mshot as well:

Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
=========================================================== 1000
|    Yes       |  Yes  |   436  | 1.23M | 9620M | + 0.2% 32     |
Yes       |  Yes  |  3125  | 5.37M | 1374M | +87.2%

which does show that there's another win on top for just queueing
these sends and doing a single send to handle them, rather than
needing to prepare a send for each buffer. Part of that may be that
you simply run out of SQEs and then have to submit regardless of
where you are at.

How many sockets did you test with? It's 1 SQE per sock max

The above is just one, but I've run it with a lot more sockets. Nothing
ilke thousands, but 64-128.

+87% sounds like a huge difference, and I don't understand where it
comes from, hence the question

There are several things:

1) Fact is that the app has to serialize sends for the unlikely case
    of sends being reordered because of the condition outlined in the
    patch that enables provided buffer support for send. This is the
    largest win, particularly with smaller packets, as it ruins the
    send pipeline.

Do those small packets force it to poll?

2) We're posting fewer SQEs. That's the multishot win. Obivously not
    as large, but it does help.

People have asked in the past on how to serialize sends, and I've had to
tell them that it isn't really possible. The only option we had was
using drain or links, which aren't ideal nor very flexible. Using
provided buffers finally gives the application a way to do that without
needing to do anything really. Does every application need it? Certainly
not, but for the ones that do, I do think it provides a great
alternative that's better performing than doing single sends at the
time.

As per note on additional userspace backlog, any real generic app
and especially libs would need to do more to support it.

--
Pavel Begunkov