On 3/13/24 2:26 PM, Pavel Begunkov wrote: > On 3/13/24 20:25, Jens Axboe wrote: >> On 3/12/24 3:44 PM, David Wei wrote: >>> Add an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that >>> is set up for ZC Rx. The request reads skbs from a socket. Completions >>> are posted into the main CQ for each page frag read. >>> >>> Big CQEs (CQE32) is required as the OP_RECV_ZC specific metadata (ZC >>> region, offset, len) are stored in the extended 16 bytes as a >>> struct io_uring_rbuf_cqe. >>> >>> For now there is no limit as to how much work each OP_RECV_ZC request >>> does. It will attempt to drain a socket of all available data. >>> >>> Multishot requests are also supported. The first time an io_recvzc >>> request completes, EAGAIN is returned which arms an async poll. Then, in >>> subsequent runs in task work, IOU_ISSUE_SKIP_COMPLETE is returned to >>> continue async polling. >> >> I'd probably drop that last paragraph, this is how all multishot >> requests work and is implementation detail that need not go in the >> commit message. Probably suffices just to say it supports multishot. >> >>> @@ -695,7 +701,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, >>> unsigned int cflags; >>> cflags = io_put_kbuf(req, issue_flags); >>> - if (msg->msg_inq && msg->msg_inq != -1) >>> + if (msg && msg->msg_inq && msg->msg_inq != -1) >>> cflags |= IORING_CQE_F_SOCK_NONEMPTY; >>> if (!(req->flags & REQ_F_APOLL_MULTISHOT)) { >>> @@ -723,7 +729,7 @@ static inline bool io_recv_finish(struct io_kiocb *req, int *ret, >>> goto enobufs; >>> /* Known not-empty or unknown state, retry */ >>> - if (cflags & IORING_CQE_F_SOCK_NONEMPTY || msg->msg_inq == -1) { >>> + if (cflags & IORING_CQE_F_SOCK_NONEMPTY || (msg && msg->msg_inq == -1)) { >>> if (sr->nr_multishot_loops++ < MULTISHOT_MAX_RETRY) >>> return false; >>> /* mshot retries exceeded, force a requeue */ >> >> Maybe refactor this a bit so that you don't need to add these NULL >> checks? That seems pretty fragile, hard to read, and should be doable >> without extra checks. > > That chunk can be completely thrown away, we're not using > io_recv_finish() here anymore > > >>> @@ -1053,6 +1058,85 @@ struct io_zc_rx_ifq *io_zc_verify_sock(struct io_kiocb *req, >>> return ifq; >>> } >>> +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) >>> +{ >>> + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); >>> + >>> + /* non-iopoll defer_taskrun only */ >>> + if (!req->ctx->task_complete) >>> + return -EINVAL; >> >> What's the reasoning behind this? > > CQ locking, see the comment a couple lines below > > >>> + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); >>> + struct io_zc_rx_ifq *ifq; >>> + struct socket *sock; >>> + int ret; >>> + >>> + /* >>> + * We're posting CQEs deeper in the stack, and to avoid taking CQ locks >>> + * we serialise by having only the master thread modifying the CQ with >>> + * DEFER_TASkRUN checked earlier and forbidding executing it from io-wq. >>> + * That's similar to io_check_multishot() for multishot CQEs. >>> + */ > > This one ^^, though it doesn't read well, I should reword it for > next time. > >>> + if (issue_flags & IO_URING_F_IOWQ) >>> + return -EAGAIN; >>> + if (WARN_ON_ONCE(!(issue_flags & IO_URING_F_NONBLOCK))) >>> + return -EAGAIN; >> >> If rebased on the current tree, does this go away? > > It's just a little behind not to have that change > -- Jens Axboe