Pavel Begunkov wrote: > On 4/5/24 21:04, Oliver Crumrine wrote: > > Pavel Begunkov wrote: > >> On 4/4/24 23:17, Oliver Crumrine wrote: > >>> In his patch to enable zerocopy networking for io_uring, Pavel Begunkov > >>> specifically disabled REQ_F_CQE_SKIP, as (at least from my > >>> understanding) the userspace program wouldn't receive the > >>> IORING_CQE_F_MORE flag in the result value. > >> > >> No. IORING_CQE_F_MORE means there will be another CQE from this > >> request, so a single CQE without IORING_CQE_F_MORE is trivially > >> fine. > >> > >> The problem is the semantics, because by suppressing the first > >> CQE you're loosing the result value. You might rely on WAITALL > > That's already happening with io_send. > > Right, and it's still annoying and hard to use Another solution might be something where there is a counter that stores how many CQEs with REQ_F_CQE_SKIP have been processed. Before exiting, userspace could call a function like: io_wait_completions(int completions) which would wait until everything is done, and then userspace could peek the completion ring. > > >> as other sends and "fail" (in terms of io_uring) the request > >> in case of a partial send posting 2 CQEs, but that's not a great > >> way and it's getting userspace complicated pretty easily. > >> > >> In short, it was left out for later because there is a > >> better way to implement it, but it should be done carefully > > Maybe we could put the return values in the notifs? That would be a > > discrepancy between io_send and io_send_zc, though. > > Yes. And yes, having a custom flavour is not good. It'd only > be well usable if apart from returning the actual result > it also guarantees there will be one and only one CQE, then > the userspace doesn't have to do the dancing with counting > and checking F_MORE. In fact, I outlined before how a generic > solution may looks like: > > https://github.com/axboe/liburing/issues/824 > > The only interesting part, IMHO, is to be able to merge the > main completion with its notification. Below is an old stash > rebased onto for-6.10. The only thing missing is relinking, > but maybe we don't even care about it. I need to cover it > well with tests. The patch looks pretty good. The only potential issue is that you store the res of the normal CQE into the notif CQE. This overwrites the IORING_CQE_F_NOTIF with IORING_CQE_F_MORE. This means that the notif would indicate to userspace that there will be another CQE, of which there won't. > > > > > commit ca5e4fb6d105b5dfdf3768d46ce01529b7bb88c5 > Author: Pavel Begunkov <asml.silence@xxxxxxxxx> > Date: Sat Apr 6 15:46:38 2024 +0100 > > io_uring/net: introduce single CQE send zc mode > > IORING_OP_SEND[MSG]_ZC requests are posting two completions, one to > notify that the data was queued, and later a second, usually referred > as "notification", to let the user know that the buffer used can be > reused/freed. In some cases the user might not care about the main > completion and would be content getting only the notification, which > would allow to simplify the userspace. > > One example is when after a send the user would be waiting for the other > end to get the message and reply back not pushing any more data in the > meantime. Another case is unreliable protocols like UDP, which do not > require a confirmation from the other end before dropping buffers, and > so the notifications are usually posted shortly after the send request > is queued. > > Add a flag merging completions into a single CQE. cqe->res will store > the send's result as usual, and it will have IORING_CQE_F_NOTIF set if > the buffer was potentially used. Timewise, it would be posted at the > moment when the notification would have been originally completed. > > Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx> > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 7bd10201a02b..e2b528c341c9 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -356,6 +356,7 @@ enum io_uring_op { > #define IORING_RECV_MULTISHOT (1U << 1) > #define IORING_RECVSEND_FIXED_BUF (1U << 2) > #define IORING_SEND_ZC_REPORT_USAGE (1U << 3) > +#define IORING_SEND_ZC_COMBINE_CQE (1U << 4) > > /* > * cqe.res for IORING_CQE_F_NOTIF if > diff --git a/io_uring/net.c b/io_uring/net.c > index a74287692071..052f030ab8f8 100644 > --- a/io_uring/net.c > +++ b/io_uring/net.c > @@ -992,7 +992,19 @@ void io_send_zc_cleanup(struct io_kiocb *req) > } > } > > -#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | IORING_RECVSEND_FIXED_BUF) > +static inline void io_sendzc_adjust_res(struct io_kiocb *req) > +{ > + struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg); > + > + if (sr->flags & IORING_SEND_ZC_COMBINE_CQE) { > + sr->notif->cqe.res = req->cqe.res; > + req->flags |= REQ_F_CQE_SKIP; > + } > +} > + > +#define IO_ZC_FLAGS_COMMON (IORING_RECVSEND_POLL_FIRST | \ > + IORING_RECVSEND_FIXED_BUF | \ > + IORING_SEND_ZC_COMBINE_CQE) > #define IO_ZC_FLAGS_VALID (IO_ZC_FLAGS_COMMON | IORING_SEND_ZC_REPORT_USAGE) > > int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > @@ -1022,6 +1034,8 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > if (zc->flags & ~IO_ZC_FLAGS_VALID) > return -EINVAL; > if (zc->flags & IORING_SEND_ZC_REPORT_USAGE) { > + if (zc->flags & IORING_SEND_ZC_COMBINE_CQE) > + return -EINVAL; > io_notif_set_extended(notif); > io_notif_to_data(notif)->zc_report = true; > } > @@ -1197,6 +1211,9 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) > else if (zc->done_io) > ret = zc->done_io; > > + io_req_set_res(req, ret, IORING_CQE_F_MORE); > + io_sendzc_adjust_res(req); > + > /* > * If we're in io-wq we can't rely on tw ordering guarantees, defer > * flushing notif to io_send_zc_cleanup() > @@ -1205,7 +1222,6 @@ int io_send_zc(struct io_kiocb *req, unsigned int issue_flags) > io_notif_flush(zc->notif); > io_req_msg_cleanup(req, 0); > } > - io_req_set_res(req, ret, IORING_CQE_F_MORE); > return IOU_OK; > } > > else if (sr->done_io) > ret = sr->done_io; > > + io_req_set_res(req, ret, IORING_CQE_F_MORE); > + io_sendzc_adjust_res(req); > + > /* > * If we're in io-wq we can't rely on tw ordering guarantees, defer > * flushing notif to io_send_zc_cleanup() > @@ -1266,7 +1285,6 @@ int io_sendmsg_zc(struct io_kiocb *req, unsigned int issue_flags) > io_notif_flush(sr->notif); > io_req_msg_cleanup(req, 0); > } > - io_req_set_res(req, ret, IORING_CQE_F_MORE); > return IOU_OK; > } > > @@ -1278,8 +1296,10 @@ void io_sendrecv_fail(struct io_kiocb *req) > req->cqe.res = sr->done_io; > > if ((req->flags & REQ_F_NEED_CLEANUP) && > - (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) > + (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC)) { > req->cqe.flags |= IORING_CQE_F_MORE; > + io_sendzc_adjust_res(req); > + } > } > > int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > > > -- > Pavel Begunkov