On 11/11/2020 00:07, Victor Stewart wrote: > On Tue, Nov 10, 2020 at 11:26 PM Pavel Begunkov <asml.silence@xxxxxxxxx> wrote: >>> we'd be looking at approx +100% throughput each on the send and recv >>> paths (per TCP_ZEROCOPY_RECEIVE benchmarks).> >>> these would be io_uring only operations given the sendmsg completion >>> logic described below. want to get some conscious that this design >>> could/would be acceptable for merging before I begin writing the code. >>> >>> the problem with zerocopy send is the asynchronous ACK from the NIC >>> confirming transmission. and you can’t just block on a syscall til >>> then. MSG_ZEROCOPY tackled this by putting the ACK on the >>> MSG_ERRQUEUE. but that logic is very disjointed and requires a double >>> completion (once from sendmsg once the send is enqueued, and again >>> once the NIC ACKs the transmission), and requires costly userspace >>> bookkeeping. >>> >>> so what i propose instead is to exploit the asynchrony of io_uring. >>> >>> you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then >>> sometime later receive the completion event on the ring’s completion >>> queue (either failure or success once ACK-ed by the NIC). 1 unified >>> completion flow. >> >> I though about it after your other email. It makes sense for message >> oriented protocols but may not for streams. That's because a user >> may want to call >> >> send(); >> send(); >> >> And expect right ordering, and that where waiting for ACK may add a lot >> of latency, so returning from the call here is a notification that "it's >> accounted, you may send more and order will be preserved". >> >> And since ACKs may came a long after, you may put a lot of code and stuff >> between send()s and still suffer latency (and so potentially throughput >> drop). >> >> As for me, for an optional feature sounds sensible, and should work well >> for some use cases. But for others it may be good to have 2 of >> notifications (1. ready to next send(), 2. ready to recycle buf). >> E.g. 2 CQEs, that wouldn't work without a bit of io_uring patching. >> > > we could make it datagram only, like check the socket was created with no need, streams can also benefit from it. > SOCK_DGRAM and fail otherwise... if it requires too much io_uring > changes / possible regression to accomodate a 2 cqe mode. May be easier to do via two requests with the second receiving errors (yeah, msg_control again). >>> we can somehow tag the socket as registered to io_uring, then when the >> >> I'd rather tag a request > > as long as the NIC is able to find / callback the ring about > transmission ACK, whatever the path of least resistance is is best. > >> >>> NIC ACKs, instead of finding the socket's error queue and putting the >>> completion there like MSG_ZEROCOPY, the kernel would find the io_uring >>> instance the socket is registered to and call into an io_uring >>> sendmsg_zerocopy_completion function. Then the cqe would get pushed >>> onto the completion queue.> >>> the "recvmsg zerocopy" is straight forward enough. mimicking >>> TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time. >> >> Receive side is inherently messed up. IIRC, TCP_ZEROCOPY_RECEIVE just >> maps skbuffs into userspace, and in general unless there is a better >> suited protocol (e.g. infiniband with richier src/dst tagging) or a very >> very smart NIC, "true zerocopy" is not possible without breaking >> multiplexing. >> >> For registered buffers you still need to copy skbuff, at least because >> of security implications. > > we can actually just force those buffers to be mmap-ed, and then when > packets arrive use vm_insert_pin or remap_pfn_range to change the > physical pages backing the virtual memory pages submmited for reading > via msg_iov. so it's transparent to userspace but still zerocopy. > (might require the user to notify io_uring when reading is > completed... but no matter). Yes, with io_uring zerocopy-recv may be done better than TCP_ZEROCOPY_RECEIVE but 1) it's still a remap. Yes, zerocopy, but not ideal 2) won't work with registered buffers, which is basically a set of pinned pages that have a userspace mapping. After such remap that mapping wouldn't be in sync and that gets messy. >>> the other big concern is the lifecycle of the persistent memory >>> buffers in the case of nefarious actors. but since we already have >>> buffer registration for O_DIRECT, I assume those mechanics already >> >> just buffer registration, not specifically for O_DIRECT >> >>> address those issues and can just be repurposed? >> >> Depending on how long it could stuck in the net stack, we might need >> to be able to cancel those requests. That may be a problem. > > I spoke about this idea with Willem the other day and he mentioned... > > "As long as the mappings aren't unwound on process exit. But then you The pages won't be unpinned until all/related requests are gone, but for that on exit io_uring waits for them to complete. That's one of the reasons why either requests should be cancellable or short-lived and somewhat predictably time-bound. > open up to malicious applications that purposely register ranges and > then exit. The basics are straightforward to implement, but it's not > that easy to arrive at something robust." > >> >>> >>> and so with those persistent memory buffers, you'd only pay the cost >>> of pinning the memory into the kernel once upon registration, before >>> you even start your server listening... thus "free". versus pinning >>> per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy". -- Pavel Begunkov