On 9/5/22 4:09 PM, Pavel Begunkov wrote: > Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx> > --- > > Doc writing is not my strongest side, comments are welcome. > > man/io_uring_enter.2 | 44 ++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 44 insertions(+) > > diff --git a/man/io_uring_enter.2 b/man/io_uring_enter.2 > index 1a9311e..7fd275c 100644 > --- a/man/io_uring_enter.2 > +++ b/man/io_uring_enter.2 > @@ -1059,6 +1059,50 @@ value being passed in. This request type can be used to either just wake or > interrupt anyone waiting for completions on the target ring, or it can be used > to pass messages via the two fields. Available since 5.18. > > +.TP > +.B IORING_OP_SEND_ZC > +Issue the zerocopy equivalent of a > +.BR send(2) > +system call. It's similar to IORING_OP_SEND, but when the > +.I flags > +field of the > +.I "struct io_uring_cqe" > +contains IORING_CQE_F_MORE, the userspace should expect a second cqe, a.k.a. > +notification, and until then it should not modify data in the buffer. The > +notification will have the same > +.I user_data > +as the first one and its > +.I flags > +field will contain the > +.I IORING_CQE_F_NOTIF > +flag. It's guaranteed that IORING_CQE_F_MORE is set IFF the result is > +non-negative. > +.I fd > +must be set to the socket file descriptor, > +.I addr > +must contain a pointer to the buffer, > +.I len > +denotes the length of the buffer to send, and > +.I msg_flags > +holds the flags associated with the system call. When > +.I addr2 > +is non-zero it points to the address of the target with > +.I addr_len > +specifying its size, turning the request into a > +.BR sendto(2) > +system call equivalent. > + > +.B IORING_OP_SEND_ZC > +tries to avoid making intermediate data copies but still may fall back to > +copying. Furthermore, zerocopy is not always faster, especially when the > +per-request payload size is small. The two completion model is needed because > +the kernel might hold on to buffers for a long time, e.g. waiting for a TCP ACK, > +and having a separate cqe for request completions allows the userspace to push > +more data without extra delays. Note, notifications don't guarantee that the > +data has been or will ever be received by the other endpoint. I'd probably reorder this a bit to introduce it with the fact that's it's like SEND, but zero-copy. Then explain the mechanics of how MORE is set for the 2 stage completion notification if zc is done. I can shuffle it around a bit if you want me to - just let me know! > +Available since 5.20. Should be 6.0 here. -- Jens Axboe