Re: io_uring-only sendmsg + recvmsg zerocopy

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 11 Nov 2020 00:57:43 +0000

On 11/11/2020 00:07, Victor Stewart wrote:
> On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote:
>>> we'd be looking at approx +100% throughput each on the send and recv
>>> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).>
>>> these would be io_uring only operations given the sendmsg completion
>>> logic described below. want to get some conscious that this design
>>> could/would be acceptable for merging before I begin writing the code.
>>>
>>> the problem with zerocopy send is the asynchronous ACK from the NIC
>>> confirming transmission. and you can’t just block on a syscall til
>>> then. MSG_ZEROCOPY tackled this by putting the ACK on the
>>> MSG_ERRQUEUE. but that logic is very disjointed and requires a double
>>> completion (once from sendmsg once the send is enqueued, and again
>>> once the NIC ACKs the transmission), and requires costly userspace
>>> bookkeeping.
>>>
>>> so what i propose instead is to exploit the asynchrony of io_uring.
>>>
>>> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then
>>> sometime later receive the completion event on the ring’s completion
>>> queue (either failure or success once ACK-ed by the NIC). 1 unified
>>> completion flow.
>>
>> I though about it after your other email. It makes sense for message
>> oriented protocols but may not for streams. That's because a user
>> may want to call
>>
>> send();
>> send();
>>
>> And expect right ordering, and that where waiting for ACK may add a lot
>> of latency, so returning from the call here is a notification that "it's
>> accounted, you may send more and order will be preserved".
>>
>> And since ACKs may came a long after, you may put a lot of code and stuff
>> between send()s and still suffer latency (and so potentially throughput
>> drop).
>>
>> As for me, for an optional feature sounds sensible, and should work well
>> for some use cases. But for others it may be good to have 2 of
>> notifications (1. ready to next send(), 2. ready to recycle buf).
>> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching. 
>>
> 
> we could make it datagram only, like check the socket was created with

no need, streams can also benefit from it.

> SOCK_DGRAM and fail otherwise... if it requires too much io_uring
> changes / possible regression to accomodate a 2 cqe mode.

May be easier to do via two requests with the second receiving
errors (yeah, msg_control again).

>>> we can somehow tag the socket as registered to io_uring, then when the
>>
>> I'd rather tag a request
> 
> as long as the NIC is able to find / callback the ring about
> transmission ACK, whatever the path of least resistance is is best.
> 
>>
>>> NIC ACKs, instead of finding the socket's error queue and putting the
>>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring
>>> instance the socket is registered to and call into an io_uring
>>> sendmsg_zerocopy_completion function. Then the cqe would get pushed
>>> onto the completion queue.>
>>> the "recvmsg zerocopy" is straight forward enough. mimicking
>>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time.
>>
>> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just
>> maps skbuffs into userspace, and in general unless there is a better
>> suited protocol (e.g. infiniband with richier src/dst tagging) or a very
>> very smart NIC, "true zerocopy" is not possible without breaking
>> multiplexing.
>>
>> For registered buffers you still need to copy skbuff, at least because
>> of security implications.
> 
> we can actually just force those buffers to be mmap-ed, and then when
> packets arrive use vm_insert_pin or remap_pfn_range to change the
> physical pages backing the virtual memory pages submmited for reading
> via msg_iov. so it's transparent to userspace but still zerocopy.
> (might require the user to notify io_uring when reading is
> completed... but no matter).

Yes, with io_uring zerocopy-recv may be done better than
TCP_ZEROCOPY_RECEIVE but
1) it's still a remap. Yes, zerocopy, but not ideal
2) won't work with registered buffers, which is basically a set
of pinned pages that have a userspace mapping. After such remap
that mapping wouldn't be in sync and that gets messy.

>>> the other big concern is the lifecycle of the persistent memory
>>> buffers in the case of nefarious actors. but since we already have
>>> buffer registration for O_DIRECT, I assume those mechanics already
>>
>> just buffer registration, not specifically for O_DIRECT
>>
>>> address those issues and can just be repurposed?
>>
>> Depending on how long it could stuck in the net stack, we might need
>> to be able to cancel those requests. That may be a problem.
> 
> I spoke about this idea with Willem the other day and he mentioned...
> 
> "As long as the mappings aren't unwound on process exit. But then you

The pages won't be unpinned until all/related requests are gone, but
for that on exit io_uring waits for them to complete. That's one of the
reasons why either requests should be cancellable or short-lived and
somewhat predictably time-bound.

> open up to malicious applications that purposely register ranges and
> then exit. The basics are straightforward to implement, but it's not
> that easy to arrive at something robust."
> 
>>
>>>
>>> and so with those persistent memory buffers, you'd only pay the cost
>>> of pinning the memory into the kernel once upon registration, before
>>> you even start your server listening... thus "free". versus pinning
>>> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".
-- 
Pavel Begunkov