Re: [PATCH 6/8] io_uring/net: support multishot for send

Jens Axboe <axboe@xxxxxxxxx> · Mon, 26 Feb 2024 14:27:44 -0700

On 2/26/24 1:51 PM, Pavel Begunkov wrote:
> On 2/26/24 20:12, Jens Axboe wrote:
>> On 2/26/24 12:21 PM, Pavel Begunkov wrote:
>>> On 2/26/24 19:11, Jens Axboe wrote:
>>>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
> ...
>>>> I don't think that's true - if you're doing large streaming, you're
>>>> more likely to keep the socket buffer full, whereas for smallish
>>>> sends, it's less likely to be full. Testing with the silly proxy
>>>> confirms that. And
>>>
>>> I don't see any contradiction to what I said. With streaming/large
>>> sends it's more likely to be polled. For small sends and
>>> send-receive-send-... patterns the sock queue is unlikely to be full,
>>> in which case the send is processed inline, and so the feature
>>> doesn't add performance, as you agreed a couple email before.
>>
>> Gotcha, I guess I misread you, we agree that the poll side is more
>> likely on bigger buffers.
>>
>>>> outside of cases where pacing just isn't feasible, it's extra
>>>> overhead for cases where you potentially could or what.
>>>
>>> I lost it, what overhead?
>>
>> Overhead of needing to serialize the sends in the application, which may
>> include both extra memory needed and overhead in dealing with it.
> 
> I think I misread the code. Does it push 1 request for each
> send buffer / queue_send() in case of provided bufs?

Right, that's the way it's currently setup. Per send (per loop), if
you're using provided buffers, it'll do a send per buffer. If using
multishot on top of that, it'll do one send per loop regardless of the
number of buffers.

> Anyway, the overhead of serialisation would be negligent.
> And that's same extra memory you keep for the provided buffer
> pool, and you can allocate it once. Also consider that provided
> buffers are fixed size and it'd be hard to resize without waiting,
> thus the userspace would still need to have another, userspace
> backlog, it can't just drop requests. Or you make provided queues
> extra large, but it's per socket and you'd wasting lots of memory.
> 
> IOW, I don't think this overhead could anyhow close us to
> the understanding of the 30%+ perf gap.

The 32-byte case is obviously somewhat pathological, as you're going to
be much better off having a bunch of these pipelined rather than issued
serially. As you can see from the 1000 byte packets, at that point it
doesn't matter that much. It's mostly about making it simpler at that
point.

>>>> To me, the main appeal of this is the simplicity.
>>>
>>> I'd argue it doesn't seem any simpler than the alternative.
>>
>> It's certainly simpler for an application to do "add buffer to queue"
>> and not need to worry about managing sends, than it is to manage a
>> backlog of only having a single send active.
> 
> They still need to manage / re-queue sends. And maybe I
> misunderstand the point, but it's only one request inflight
> per socket in either case.

Sure, but one is a manageable condition, the other one is not. If you
can keep N inflight at the same time and only abort the chain in case of
error/short send, that's a corner case. Versus not knowing when things
get reordered, and hence always needing to serialize.

>>>>>> serialize sends. Using provided buffers makes this very easy,
>>>>>> as you don't need to care about it at all, and it eliminates
>>>>>> complexity in the application dealing with this.
>>>>>
>>>>> If I'm correct the example also serialises sends(?). I don't
>>>>> think it's that simpler. You batch, you send. Same with this, but
>>>>> batch into a provided buffer and the send is conditional.
>>>>
>>>> Do you mean the proxy example? Just want to be sure we're talking
>>>> about
>>>
>>> Yes, proxy, the one you referenced in the CV. And FWIW, I don't think
>>> it's a fair comparison without batching followed by multi-iov.
>>
>> It's not about vectored vs non-vectored IO, though you could of course
>> need to allocate an arbitrarily sized iovec that you can append to. And
>> now you need to use sendmsg rather than just send, which has further
>> overhead on top of send.
> 
> That's not nearly enough of overhead to explain the difference,
> I don't believe so, going through the net stack is quite expensive.

See above, for the 32-byte packets, it's not hard to imagine big wins by
having many shoved in vs doing them piecemeal.

And honestly, I was surprised at how well the stack deals with this on
the networking side! It may have room for improvement, but it's not
nearly as sluggish as I feared.

>> What kind of batching? The batching done by the tests are the same,
>> regardless of whether or not send multishot is used in the sense that we
> 
> You can say that, but I say that it moves into the kernel
> batching that can be implemented in userspace.

And then most people get it wrong or just do the basic stuff, and
performance isn't very good. Getting the most out of it can be tricky
and require extensive testing and knowledge building. I'm confident
you'd be able to write an efficient version, but that's not the same as
saying "it's trivial to write an efficient version".

>> wait on the same number of completions. As it's a basic proxy kind of
>> thing, it'll receive a packet and send a packet. Submission batching is
>> the same too, we'll submit when we have to.
> 
> "If you actually need to poll tx, you send a request and collect
> data into iov in userspace in background. When the request
> completes you send all that in batch..."
> 
> That's how it's in Thrift for example.
> 
> In terms of "proxy", the first approximation would be to
> do sth like defer_send() for normal requests as well, then
> 
> static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
>              void *data, int bid, int len)
> {
>     ...
> 
>     defer_send(data);
> 
>     while (buf = defer_backlog.get()) {
>         iov[idx++] = buf;
>     }
>     msghdr->iovlen = idx;
>     ...
> }

Yep, that's the iovec coalescing, and that could certainly be done. And
then you need to size the iov[] so that it's always big enough, OR
submit that send and still deal with managing your own backlog.

I don't think we disagree that there are other solutions. I'm saying
that I like this solution. I think it's simple to use for the cases that
can use it, and that's why the patches exist. It fits with the notion of
an async API being able to keep multiple things in flight, rather than a
semi solution where you kind of can, except not for cases X and Y
because of corner cases.

>>>> the same thing. Yes it has to serialize sends, because otherwise we
>>>> can run into the condition described in the patch that adds
>>>> provided buffer support for send. But I did bench multishot
>>>> separately from there, here's some of it:
>>>>
>>>> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte
>>>> packets. Send ring and send multishot not used:
>>>>
>>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw
>>>> ===================================================== 1000   |
>>>> No       |  No   |   437  | 1.22M | 9598M 32     |    No       |
>>>> No   |  5856  | 2.87M |  734M
>>>>
>>>> Same test, now turn on send ring:
>>>>
>>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
>>>> =========================================================== 1000
>>>> |    Yes       |  No   |   436  | 1.23M | 9620M | + 0.2% 32     |
>>>> Yes       |  No   |  3462  | 4.85M | 1237M | +68.5%
>>>>
>>>> Same test, now turn on send mshot as well:
>>>>
>>>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff
>>>> =========================================================== 1000
>>>> |    Yes       |  Yes  |   436  | 1.23M | 9620M | + 0.2% 32     |
>>>> Yes       |  Yes  |  3125  | 5.37M | 1374M | +87.2%
>>>>
>>>> which does show that there's another win on top for just queueing
>>>> these sends and doing a single send to handle them, rather than
>>>> needing to prepare a send for each buffer. Part of that may be that
>>>> you simply run out of SQEs and then have to submit regardless of
>>>> where you are at.
>>>
>>> How many sockets did you test with? It's 1 SQE per sock max
>>
>> The above is just one, but I've run it with a lot more sockets. Nothing
>> ilke thousands, but 64-128.
>>
>>> +87% sounds like a huge difference, and I don't understand where it
>>> comes from, hence the question
>>
>> There are several things:
>>
>> 1) Fact is that the app has to serialize sends for the unlikely case
>>     of sends being reordered because of the condition outlined in the
>>     patch that enables provided buffer support for send. This is the
>>     largest win, particularly with smaller packets, as it ruins the
>>     send pipeline.
> 
> Do those small packets force it to poll?

There's no polling in my testing.

>> 2) We're posting fewer SQEs. That's the multishot win. Obivously not
>>     as large, but it does help.
>>
>> People have asked in the past on how to serialize sends, and I've had to
>> tell them that it isn't really possible. The only option we had was
>> using drain or links, which aren't ideal nor very flexible. Using
>> provided buffers finally gives the application a way to do that without
>> needing to do anything really. Does every application need it? Certainly
>> not, but for the ones that do, I do think it provides a great
>> alternative that's better performing than doing single sends at the
>> time.
> 
> As per note on additional userspace backlog, any real generic app
> and especially libs would need to do more to support it.

Sure, if you get a short send or any abort in the chain, you need to
handle it. But things stall/stop at that point and you handle it, and
then you're back up and running. This is vastly different from "oh I
always need to serialize because X or Y may happen, even though it
rarely does or never does in my case".

-- 
Jens Axboe