here’s the design i’m flirting with for "recvmsg and sendmsg zerocopy" with persistent buffers patch. we'd be looking at approx +100% throughput each on the send and recv paths (per TCP_ZEROCOPY_RECEIVE benchmarks). these would be io_uring only operations given the sendmsg completion logic described below. want to get some conscious that this design could/would be acceptable for merging before I begin writing the code. the problem with zerocopy send is the asynchronous ACK from the NIC confirming transmission. and you can’t just block on a syscall til then. MSG_ZEROCOPY tackled this by putting the ACK on the MSG_ERRQUEUE. but that logic is very disjointed and requires a double completion (once from sendmsg once the send is enqueued, and again once the NIC ACKs the transmission), and requires costly userspace bookkeeping. so what i propose instead is to exploit the asynchrony of io_uring. you’d submit the IORING_OP_SENDMSG_ZEROCOPY operation, and then sometime later receive the completion event on the ring’s completion queue (either failure or success once ACK-ed by the NIC). 1 unified completion flow. we can somehow tag the socket as registered to io_uring, then when the NIC ACKs, instead of finding the socket's error queue and putting the completion there like MSG_ZEROCOPY, the kernel would find the io_uring instance the socket is registered to and call into an io_uring sendmsg_zerocopy_completion function. Then the cqe would get pushed onto the completion queue. the "recvmsg zerocopy" is straight forward enough. mimicking TCP_ZEROCOPY_RECEIVE, i'll go into specifics next time. the other big concern is the lifecycle of the persistent memory buffers in the case of nefarious actors. but since we already have buffer registration for O_DIRECT, I assume those mechanics already address those issues and can just be repurposed? and so with those persistent memory buffers, you'd only pay the cost of pinning the memory into the kernel once upon registration, before you even start your server listening... thus "free". versus pinning per sendmsg like with MSG_ZEROCOPY... thus "true zerocopy".