Re: [PATCH 6/8] io_uring/net: support multishot for send

Jens Axboe <axboe@xxxxxxxxx> · Mon, 26 Feb 2024 13:12:44 -0700

On 2/26/24 12:21 PM, Pavel Begunkov wrote:
> On 2/26/24 19:11, Jens Axboe wrote:
>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
>>>>>>> <axboe@xxxxxxxxx> wrote:
>>>>>>>> 
>>>>>>>> On 2/26/24 3:47 AM, Dylan Yudaken wrote:
>>>>>>>>> On Sun, Feb 25, 2024 at 12:46?AM Jens Axboe
>>>>>>>>> <axboe@xxxxxxxxx> wrote:
>>>>>>>>>> 
>>>>>>>>>> This works very much like the receive side, except
>>>>>>>>>> for sends. The idea is that an application can fill
>>>>>>>>>> outgoing buffers in a provided buffer group, and
>>>>>>>>>> then arm a single send that will service them all.
>>>>>>>>>> For now this variant just terminates when we are
>>>>>>>>>> out of buffers to send, and hence the application
>>>>>>>>>> needs to re-arm it if IORING_CQE_F_MORE isn't set,
>>>>>>>>>> as per usual for multishot requests.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> This feels to me a lot like just using OP_SEND with
>>>>>>>>> MSG_WAITALL as described, unless I'm missing
>>>>>>>>> something?
>>>>>>>> 
>>>>>>>> How so? MSG_WAITALL is "send X amount of data, and if
>>>>>>>> it's a short send, try again" where multishot is "send
>>>>>>>> data from this buffer group, and keep sending data
>>>>>>>> until it's empty". Hence it's the mirror of multishot 
>>>>>>>> on the receive side. Unless I'm misunderstanding you
>>>>>>>> somehow, not sure it'd be smart to add special meaning
>>>>>>>> to MSG_WAITALL with provided buffers.
>>>>>>>> 
>>>>>>> 
>>>>>>> _If_ you have the data upfront these are very similar,
>>>>>>> and only differ in that the multishot approach will give
>>>>>>> you more granular progress updates. My point was that
>>>>>>> this might not be a valuable API to people for only this
>>>>>>> use case.
>>>>>> 
>>>>>> Not sure I agree, it feels like attributing a different
>>>>>> meaning to MSG_WAITALL if you use a provided buffer vs if
>>>>>> you don't. And that to me would seem to be confusing.
>>>>>> Particularly when we have multishot on the receive side,
>>>>>> and this is identical, just for sends. Receives will keep 
>>>>>> receiving as long as there are buffers in the provided
>>>>>> group to receive into, and sends will keep sending for the
>>>>>> same condition. Either one will terminate if we run out of
>>>>>> buffers.
>>>>>> 
>>>>>> If you make MSG_WAITALL be that for provided buffers +
>>>>>> send, then that behaves differently than MSG_WAITALL with
>>>>>> receive, and MSG_WAITALL with send _without_ provided
>>>>>> buffers. I don't think overloading an existing flag for
>>>>>> this purposes is a good idea, particularly when we already
>>>>>> have the existing semantics for multishot on the receive
>>>>>> side.
>>>>> 
>>>>> I'm actually with Dylan on that and wonder where the perf
>>>>> win could come from. Let's assume TCP, sends are usually
>>>>> completed in the same syscall, otherwise your pacing is just
>>>>> bad. Thrift, for example, collects sends and packs into one
>>>>> multi iov request during a loop iteration. If the req
>>>>> completes immediately then the userspace just wouldn't have
>>>>> time to push more buffers by definition (assuming single
>>>>> threading).
>>>> 
>>>> The problem only occurs when they don't complete inline, and
>>>> now you get reordering. The application could of course attempt
>>>> to do proper pacing and see if it can avoid that condition. If
>>>> not, it now needs to
>>> 
>>> Ok, I admit that there are more than valid cases when artificial
>>> pacing is not an option, which is why I also laid out the polling
>>> case. Let's also say that limits potential perf wins to streaming
>>> and very large transfers (like files), not "lots of relatively
>>> small request-response" kinds of apps.
>> 
>> I don't think that's true - if you're doing large streaming, you're
>> more likely to keep the socket buffer full, whereas for smallish
>> sends, it's less likely to be full. Testing with the silly proxy
>> confirms that. And
> 
> I don't see any contradiction to what I said. With streaming/large 
> sends it's more likely to be polled. For small sends and 
> send-receive-send-... patterns the sock queue is unlikely to be full,
> in which case the send is processed inline, and so the feature
> doesn't add performance, as you agreed a couple email before.

Gotcha, I guess I misread you, we agree that the poll side is more
likely on bigger buffers.

>> outside of cases where pacing just isn't feasible, it's extra
>> overhead for cases where you potentially could or what.
> 
> I lost it, what overhead?

Overhead of needing to serialize the sends in the application, which may
include both extra memory needed and overhead in dealing with it.

>> To me, the main appeal of this is the simplicity.
> 
> I'd argue it doesn't seem any simpler than the alternative.

It's certainly simpler for an application to do "add buffer to queue"
and not need to worry about managing sends, than it is to manage a
backlog of only having a single send active.

>>>> serialize sends. Using provided buffers makes this very easy,
>>>> as you don't need to care about it at all, and it eliminates
>>>> complexity in the application dealing with this.
>>> 
>>> If I'm correct the example also serialises sends(?). I don't 
>>> think it's that simpler. You batch, you send. Same with this, but
>>> batch into a provided buffer and the send is conditional.
>> 
>> Do you mean the proxy example? Just want to be sure we're talking
>> about
> 
> Yes, proxy, the one you referenced in the CV. And FWIW, I don't think
> it's a fair comparison without batching followed by multi-iov.

It's not about vectored vs non-vectored IO, though you could of course
need to allocate an arbitrarily sized iovec that you can append to. And
now you need to use sendmsg rather than just send, which has further
overhead on top of send.

What kind of batching? The batching done by the tests are the same,
regardless of whether or not send multishot is used in the sense that we
wait on the same number of completions. As it's a basic proxy kind of
thing, it'll receive a packet and send a packet. Submission batching is
the same too, we'll submit when we have to.

>> the same thing. Yes it has to serialize sends, because otherwise we
>> can run into the condition described in the patch that adds
>> provided buffer support for send. But I did bench multishot
>> separately from there, here's some of it:
>> 
>> 10G network, 3 hosts, 1 acting as a mirror proxy shuffling N-byte
>> packets. Send ring and send multishot not used:
>> 
>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw 
>> ===================================================== 1000   |
>> No       |  No   |   437  | 1.22M | 9598M 32     |    No       |
>> No   |  5856  | 2.87M |  734M
>> 
>> Same test, now turn on send ring:
>> 
>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff 
>> =========================================================== 1000
>> |    Yes       |  No   |   436  | 1.23M | 9620M | + 0.2% 32     |
>> Yes       |  No   |  3462  | 4.85M | 1237M | +68.5%
>> 
>> Same test, now turn on send mshot as well:
>> 
>> Pkt sz | Send ring | mshot |  usec  |  QPS  |  Bw   | Diff 
>> =========================================================== 1000
>> |    Yes       |  Yes  |   436  | 1.23M | 9620M | + 0.2% 32     |
>> Yes       |  Yes  |  3125  | 5.37M | 1374M | +87.2%
>> 
>> which does show that there's another win on top for just queueing
>> these sends and doing a single send to handle them, rather than
>> needing to prepare a send for each buffer. Part of that may be that
>> you simply run out of SQEs and then have to submit regardless of
>> where you are at.
> 
> How many sockets did you test with? It's 1 SQE per sock max

The above is just one, but I've run it with a lot more sockets. Nothing
ilke thousands, but 64-128.

> +87% sounds like a huge difference, and I don't understand where it
> comes from, hence the question

There are several things:

1) Fact is that the app has to serialize sends for the unlikely case
   of sends being reordered because of the condition outlined in the
   patch that enables provided buffer support for send. This is the
   largest win, particularly with smaller packets, as it ruins the
   send pipeline.

2) We're posting fewer SQEs. That's the multishot win. Obivously not
   as large, but it does help.

People have asked in the past on how to serialize sends, and I've had to
tell them that it isn't really possible. The only option we had was
using drain or links, which aren't ideal nor very flexible. Using
provided buffers finally gives the application a way to do that without
needing to do anything really. Does every application need it? Certainly
not, but for the ones that do, I do think it provides a great
alternative that's better performing than doing single sends at the
time.

>>> Another downside is that you need a provided queue per socket, 
>>> which sounds pretty expensive for 100s if not 1000s socket apps.
>> 
>> That's certainly true. But either you need backlog per socket
>> anyway in the app, or you only send single buffers anyway (in a
>> typical request/response kind of fashion) between receives and you
>> don't need it at all.
> 
> That's pinning pages and maping them, which surely is not bad but
> with everything else equal malloc()/stack alloc is much nicer in
> terms of resources. (Not talking about CPU setup overhead).

Sure, it's not free in terms of memory either. As mentioned several
times, the main win is on efficiency and in reducing complexity, and
both of those are pretty nice imho.

>>>>> If you actually need to poll tx, you send a request and
>>>>> collect data into iov in userspace in background. When the
>>>>> request completes you send all that in batch. You can
>>>>> probably find a niche example when batch=1 in this case, but
>>>>> I don't think anyone would care.
>>>>> 
>>>>> The example doesn't use multi-iov, and also still has to 
>>>>> serialise requests, which naturally serialises buffer
>>>>> consumption w/o provided bufs.
>>>> 
>>>> IMHO there's no reason NOT to have both a send with provided
>>>> buffers and a multishot send. The alternative would be to have
>>>> send-N, where you pass in N. But I don't see much point to that
>>>> over "just drain the whole pending list". The obvious use case
>>>> is definitely send multishot, but
>>> 
>>> Not sure I follow, but in all cases I was contemplating about you
>>> sends everything you have at the moment.
>>> 
>>>> what would the reasoning be to prohibit pacing by explicitly
>>>> disallowing only doing a single buffer (or a partial queue)? As
>>>> mentioned earlier, I like keeping the symmetry with the receive
>>>> side for multishot, and not make it any different unless
>>>> there's a reason to.
>>> 
>>> There are different, buffer content kernel (rx) vs userspace
>>> (tx) provided, provided queue / group per socket vs shared. Wake
>>> ups for multishots as per below. It's not like it's a one line
>>> change, so IMHO requires to be giving some benefits.
>> 
>> Are you talking about provided buffers, or multishot specifically?
>> I
> 
> I assumed that any of them would retry until the queue is exhausted, 
> at least that sounds more efficient and used in all comments.

That is what it does, it'll keep sending until it runs out of buffers
(or hits an error, short send, whatever).

>> think both are standalone pretty much as simple as they can be. And
>> if the argument is "just have send with provided buffers be
>> multishot by default",
> 
> It's not, rx and tx are different, e.g. true tx multishot doesn't 
> seem to be possible because of that.

In the sense that rx and poll trigger on data now being available isn't
feasible on send, yeah they are not exact mirrors of each other. But
they are as close as they can be. If there was, or ever will be, an
efficient way to re-trigger a multishot send, that would certainly be a
doable and an easy addition to make on top of this. It really only
changes the termination point, if you run out of buffers you just go to
whatever arming method would be suitable for that. But since the reason
for recv multishot is to avoid hammering on the locking on the poll
side, I'm not convinced that having a perpetual multishot send would
make a lot more sense than simply doing another one when needed. If
you're socket buffer bound on multishot send, then the perpetual poll
trigger works and is useful.

>> then that single patch is basically the two patches combined. 
>> There's no simplification there. Outside of a strong argument for
>> why it would never make sense to do single shot send with provided
>> buffers, I really don't want to combine them into one single
>> action.
> 
> In the current form it does make more sense to have multishot
> optionally.

I obviously agree on that too, kept them separate in the v4 posting as
well.

-- 
Jens Axboe