Re: [PATCH 6/8] io_uring/net: support multishot for send

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/28/24 5:39 AM, Pavel Begunkov wrote:
> On 2/26/24 21:27, Jens Axboe wrote:
>> On 2/26/24 1:51 PM, Pavel Begunkov wrote:
>>> On 2/26/24 20:12, Jens Axboe wrote:
>>>> On 2/26/24 12:21 PM, Pavel Begunkov wrote:
>>>>> On 2/26/24 19:11, Jens Axboe wrote:
>>>>>> On 2/26/24 8:41 AM, Pavel Begunkov wrote:
>>>>>>> On 2/26/24 15:16, Jens Axboe wrote:
>>>>>>>> On 2/26/24 7:36 AM, Pavel Begunkov wrote:
>>>>>>>>> On 2/26/24 14:27, Jens Axboe wrote:
>>>>>>>>>> On 2/26/24 7:02 AM, Dylan Yudaken wrote:
>>>>>>>>>>> On Mon, Feb 26, 2024 at 1:38?PM Jens Axboe
>>> ...
>>>>>> I don't think that's true - if you're doing large streaming, you're
>>>>>> more likely to keep the socket buffer full, whereas for smallish
>>>>>> sends, it's less likely to be full. Testing with the silly proxy
>>>>>> confirms that. And
>>>>>
>>>>> I don't see any contradiction to what I said. With streaming/large
>>>>> sends it's more likely to be polled. For small sends and
>>>>> send-receive-send-... patterns the sock queue is unlikely to be full,
>>>>> in which case the send is processed inline, and so the feature
>>>>> doesn't add performance, as you agreed a couple email before.
>>>>
>>>> Gotcha, I guess I misread you, we agree that the poll side is more
>>>> likely on bigger buffers.
>>>>
>>>>>> outside of cases where pacing just isn't feasible, it's extra
>>>>>> overhead for cases where you potentially could or what.
>>>>>
>>>>> I lost it, what overhead?
>>>>
>>>> Overhead of needing to serialize the sends in the application, which may
>>>> include both extra memory needed and overhead in dealing with it.
>>>
>>> I think I misread the code. Does it push 1 request for each
>>> send buffer / queue_send() in case of provided bufs?
>>
>> Right, that's the way it's currently setup. Per send (per loop), if
>> you're using provided buffers, it'll do a send per buffer. If using
>> multishot on top of that, it'll do one send per loop regardless of the
>> number of buffers.
> 
> Ok, then I'd say the performance tests are misleading, at least
> the proxy one. For selected buffers, sending tons of requests sounds
> unnecessarily expensive, but I don't have numbers to back it. The
> normal case must employ the natural batching it has with
> serialisation.
> 
> It's important comparing programs that are optimised and close
> to what is achievable in userspace. You wouldn't compare it with
> 
> while (i = 0; i < nr_bytes; i++)
>     send(data+i, bytes=1);
> 
> and claim that it's N times faster. Or surely we can add an
> empty "for" loop to one side and say "users are stupid and
> will put garbage here anyway, so fair game", but then it
> needs a note in small font "valid only for users who can't
> write code up to standard"
> 
> static void defer_send() {
>     ps = malloc(sizeof(*ps));
> }
> 
> fwiw, maybe malloc is not really expensive there, but
> that sounds like a pretty bad practice for a hi perf
> benchmark.

It's comparing using send using a manually managed backlog, or using
send where you don't have to do that. I think that's pretty valid by
itself. If you don't do that, you need to use sendmsg and append iovecs.
Which is obviously also a very valid use case and would avoid the
backlog, at the cost of sendmsg being more expensive than send. If done
right, the sendmsg+append would be a lot closer to send with multishot.

When I have some time I can do add the append case, or feel free to do
that yourself, and I can run some testing with that too.

>>> Anyway, the overhead of serialisation would be negligent.
>>> And that's same extra memory you keep for the provided buffer
>>> pool, and you can allocate it once. Also consider that provided
>>> buffers are fixed size and it'd be hard to resize without waiting,
>>> thus the userspace would still need to have another, userspace
>>> backlog, it can't just drop requests. Or you make provided queues
>>> extra large, but it's per socket and you'd wasting lots of memory.
>>>
>>> IOW, I don't think this overhead could anyhow close us to
>>> the understanding of the 30%+ perf gap.
>>
>> The 32-byte case is obviously somewhat pathological, as you're going to
>> be much better off having a bunch of these pipelined rather than issued
>> serially. As you can see from the 1000 byte packets, at that point it
>> doesn't matter that much. It's mostly about making it simpler at that
>> point.
>>
>>>>>> To me, the main appeal of this is the simplicity.
>>>>>
>>>>> I'd argue it doesn't seem any simpler than the alternative.
>>>>
>>>> It's certainly simpler for an application to do "add buffer to queue"
>>>> and not need to worry about managing sends, than it is to manage a
>>>> backlog of only having a single send active.
>>>
>>> They still need to manage / re-queue sends. And maybe I
>>> misunderstand the point, but it's only one request inflight
>>> per socket in either case.
>>
>> Sure, but one is a manageable condition, the other one is not. If you
>> can keep N inflight at the same time and only abort the chain in case of
>> error/short send, that's a corner case. Versus not knowing when things
>> get reordered, and hence always needing to serialize.
> 
> Manageable or not, you still have to implement all that, whereas
> serialisation is not complex and I doubt it's anywhere expensive
> enough to overturn the picture. It seems that multishot selected
> bufs would also need serialisation, and for oneshots managing
> multiple requests when you don't know which one sends what buffer
> would be complicated in real scenarios.

What "all that"? It's pretty trivial when you have a normal abort
condition to handle it. For the sendmsg case, you can append multiple
iovecs, but you can still only have one sendmsg inflight. If you start
prepping the next one before the previous one has successfully
completed, you have the same issue again.

Only serialization you need is to only have one inflight, which is in
your best interest anyway as it would be more wasteful to submit several
which both empty that particular queue.

>>>> What kind of batching? The batching done by the tests are the same,
>>>> regardless of whether or not send multishot is used in the sense that we
>>>
>>> You can say that, but I say that it moves into the kernel
>>> batching that can be implemented in userspace.
>>
>> And then most people get it wrong or just do the basic stuff, and
>> performance isn't very good. Getting the most out of it can be tricky
>> and require extensive testing and knowledge building. I'm confident
>> you'd be able to write an efficient version, but that's not the same as
>> saying "it's trivial to write an efficient version".
> 
> It actually _is_ trivial to write an efficient version for anyone
> competent enough and having a spare day for that, which is usually
> in a form of a library

If using sendmsg.

> I'm a firm believer of not putting into the kernel what can
> already be well implemented in userspace, because the next step
> could be "firefox can be implemented in userspace, but it requires
> knowledge building, so let's implement it in the kernel". At
> least I recall nginx / HTTP servers not flying well

Sorry but that's nonsense, we add kernel features all the time that
could be done (just less efficiently) in userspace. And the firefox
comment is crazy hyperbole, not relevant here at all.

>>> In terms of "proxy", the first approximation would be to
>>> do sth like defer_send() for normal requests as well, then
>>>
>>> static void __queue_send(struct io_uring *ring, struct conn *c, int fd,
>>>               void *data, int bid, int len)
>>> {
>>>      ...
>>>
>>>      defer_send(data);
>>>
>>>      while (buf = defer_backlog.get()) {
>>>          iov[idx++] = buf;
>>>      }
>>>      msghdr->iovlen = idx;
>>>      ...
>>> }
>>
>> Yep, that's the iovec coalescing, and that could certainly be done. And
>> then you need to size the iov[] so that it's always big enough, OR
>> submit that send and still deal with managing your own backlog.
> 
> Which is as trivial as iov.push_back() or a couple of lines with
> realloc, whereas for selected buffers you would likely need to
> wait for the previous request[s] to complete, which you cannot
> do in place.

Go implement it and we can benchmark. It won't solve the sendmsg being
slower than send situation in general, but it should certainly get a lot
closer in terms of using sendmsg and serializing per send issue.

> The point is, it seems that when you'd try to write something
> error proof and real life ready, it wouldn't look simpler or
> much simpler than the approach suggested, then the performance
> is the question.

The condition for sendmsg with append and send with multishot handling
is basically the same, you need to deal checking the send result and
restarting your send from there. Obviously in practice this will
never/rarely happen unless the connection is aborted anyway. If you use
send multishot with MSG_WAITALL, then if the buffer fills you'll just
restart when it has space, nothing to handle there.

-- 
Jens Axboe





[Index of Archives]     [Linux Samsung SoC]     [Linux Rockchip SoC]     [Linux Actions SoC]     [Linux for Synopsys ARC Processors]     [Linux NFS]     [Linux NILFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]


  Powered by Linux