Re: [PATCH 6/8] io_uring/net: support multishot for send

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 28 Feb 2024 12:39:36 +0000

On 2/26/24 21:27, Jens Axboe wrote:
On 2/26/24 1:51 PM, Pavel Begunkov wrote:
On 2/26/24 20:12, Jens Axboe wrote:
On 2/26/24 12:21 PM, Pavel Begunkov wrote:
On 2/26/24 19:11, Jens Axboe wrote:
On 2/26/24 8:41 AM, Pavel Begunkov wrote:
On 2/26/24 15:16, Jens Axboe wrote:
On 2/26/24 7:36 AM, Pavel Begunkov wrote:
On 2/26/24 14:27, Jens Axboe wrote:
On 2/26/24 7:02 AM, Dylan Yudaken wrote:
On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
...
I don't think that's true - if you're doing large streaming, you're
more likely to keep the socket buffer full, whereas for smallish
sends, it's less likely to be full. Testing with the silly proxy
confirms that. And

I don't see any contradiction to what I said. With streaming/large
sends it's more likely to be polled. For small sends and
send-receive-send-... patterns the sock queue is unlikely to be full,
in which case the send is processed inline, and so the feature
doesn't add performance, as you agreed a couple email before.

Gotcha, I guess I misread you, we agree that the poll side is more
likely on bigger buffers.

outside of cases where pacing just isn't feasible, it's extra
overhead for cases where you potentially could or what.

I lost it, what overhead?

Overhead of needing to serialize the sends in the application, which may
include both extra memory needed and overhead in dealing with it.

I think I misread the code. Does it push 1 request for each
send buffer / queue_send() in case of provided bufs?

Right, that's the way it's currently setup. Per send (per loop), if
you're using provided buffers, it'll do a send per buffer. If using
multishot on top of that, it'll do one send per loop regardless of the
number of buffers.

Ok, then I'd say the performance tests are misleading, at least
the proxy one. For selected buffers, sending tons of requests sounds
unnecessarily expensive, but I don't have numbers to back it. The
normal case must employ the natural batching it has with
serialisation.

It's important comparing programs that are optimised and close
to what is achievable in userspace. You wouldn't compare it with

while (i = 0; i < nr_bytes; i++)
	send(data+i, bytes=1);

and claim that it's N times faster. Or surely we can add an
empty "for" loop to one side and say "users are stupid and
will put garbage here anyway, so fair game", but then it
needs a note in small font "valid only for users who can't
write code up to standard"

static void defer_send() {
	ps = malloc(sizeof(*ps));
}

fwiw, maybe malloc is not really expensive there, but
that sounds like a pretty bad practice for a hi perf
benchmark.

Anyway, the overhead of serialisation would be negligent.
And that's same extra memory you keep for the provided buffer
pool, and you can allocate it once. Also consider that provided
buffers are fixed size and it'd be hard to resize without waiting,
thus the userspace would still need to have another, userspace
backlog, it can't just drop requests. Or you make provided queues
extra large, but it's per socket and you'd wasting lots of memory.

IOW, I don't think this overhead could anyhow close us to
the understanding of the 30%+ perf gap.

The 32-byte case is obviously somewhat pathological, as you're going to
be much better off having a bunch of these pipelined rather than issued
serially. As you can see from the 1000 byte packets, at that point it
doesn't matter that much. It's mostly about making it simpler at that
point.

To me, the main appeal of this is the simplicity.

I'd argue it doesn't seem any simpler than the alternative.

It's certainly simpler for an application to do "add buffer to queue"
and not need to worry about managing sends, than it is to manage a
backlog of only having a single send active.

They still need to manage / re-queue sends. And maybe I
misunderstand the point, but it's only one request inflight
per socket in either case.

Sure, but one is a manageable condition, the other one is not. If you
can keep N inflight at the same time and only abort the chain in case of
error/short send, that's a corner case. Versus not knowing when things
get reordered, and hence always needing to serialize.

Manageable or not, you still have to implement all that, whereas
serialisation is not complex and I doubt it's anywhere expensive
enough to overturn the picture. It seems that multishot selected
bufs would also need serialisation, and for oneshots managing
multiple requests when you don't know which one sends what buffer
would be complicated in real scenarios.

What kind of batching? The batching done by the tests are the same,
regardless of whether or not send multishot is used in the sense that we

You can say that, but I say that it moves into the kernel
batching that can be implemented in userspace.

And then most people get it wrong or just do the basic stuff, and
performance isn't very good. Getting the most out of it can be tricky
and require extensive testing and knowledge building. I'm confident
you'd be able to write an efficient version, but that's not the same as
saying "it's trivial to write an efficient version".

It actually _is_ trivial to write an efficient version for anyone
competent enough and having a spare day for that, which is usually
in a form of a library

I'm a firm believer of not putting into the kernel what can
already be well implemented in userspace, because the next step
could be "firefox can be implemented in userspace, but it requires
knowledge building, so let's implement it in the kernel". At
least I recall nginx / HTTP servers not flying well

wait on the same number of completions. As it's a basic proxy kind of
thing, it'll receive a packet and send a packet. Submission batching is
the same too, we'll submit when we have to.

"If you actually need to poll tx, you send a request and collect
data into iov in userspace in background. When the request
completes you send all that in batch..."

That's how it's in Thrift for example.

In terms of "proxy", the first approximation would be to
do sth like defer_send() for normal requests as well, then

static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
              void *data, int bid, int len)
{
     ...

     defer_send(data);

     while (buf = defer_backlog.get()) {
         iov[idx++] = buf;
     }
     msghdr->iovlen = idx;
     ...
}

Yep, that's the iovec coalescing, and that could certainly be done. And
then you need to size the iov[] so that it's always big enough, OR
submit that send and still deal with managing your own backlog.

Which is as trivial as iov.push_back() or a couple of lines with
realloc, whereas for selected buffers you would likely need to
wait for the previous request[s] to complete, which you cannot
do in place.

The point is, it seems that when you'd try to write something
error proof and real life ready, it wouldn't look simpler or
much simpler than the approach suggested, then the performance
is the question.

I don't think we disagree that there are other solutions. I'm saying
that I like this solution. I think it's simple to use for the cases that
can use it, and that's why the patches exist. It fits with the notion of
an async API being able to keep multiple things in flight, rather than a
semi solution where you kind of can, except not for cases X and Y
because of corner cases.
...

--
Pavel Begunkov