Re: [PATCH V3 5/9] io_uring: support SQE group

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sun, 16 Jun 2024 20:13:26 +0100

On 6/13/24 02:45, Ming Lei wrote:
On Mon, Jun 10, 2024 at 03:53:51AM +0100, Pavel Begunkov wrote:
On 5/11/24 01:12, Ming Lei wrote:
SQE group is defined as one chain of SQEs starting with the first SQE that
has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
doesn't have it set, and it is similar with chain of linked SQEs.

The main concern stays same, it adds overhead nearly to every
single hot function I can think of, as well as lots of
complexity.

Almost every sqe group change is covered by REQ_F_SQE_GROUP, so I am
not clear what the added overhead is.

Yes, and there is a dozen of such in the hot path.

Another minor issue is REQ_F_INFLIGHT, as explained before,
cancellation has to be able to find all REQ_F_INFLIGHT
requests. Requests you add to a group can have that flag
but are not discoverable by core io_uring code.

OK, we can deal with it by setting leader as REQ_F_INFLIGHT if the
flag is set for any member, since all members are guaranteed to
be drained when leader is completed. Will do it in V4.

Or fail if see one, that's also fine. REQ_F_INFLIGHT is
only set for POLL requests polling another io_uring.

Another note, I'll be looking deeper into this patch, there
is too much of random tossing around of requests / refcounting
and other dependencies, as well as odd intertwinings with
other parts.

The only thing wrt. request refcount is for io-wq, since request
reference is grabbed when the req is handled in io-wq context, and
group leader need to be completed after all members are done. That
is all special change wrt. request refcounting.

I rather mean refcounting the group leader, even if it's not
atomic.

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 7a6b190c7da7..62311b0f0e0b 100644
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index c184c9a312df..b87c5452de43 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
...
   	}
   }
+static inline bool need_queue_group_members(struct io_kiocb *req)
+{
+	return req_is_group_leader(req) && req->grp_link;
+}
+
+/* Can only be called after this request is issued */
+static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
+{
+	if (req->flags & REQ_F_SQE_GROUP) {
+		if (req_is_group_leader(req))
+			return req;
+		return req->grp_link;

I'm missing something, it seems io_group_sqe() adding all
requests of a group into a singly linked list via ->grp_link,
but here we return it as a leader. Confused.

->grp_link stores the singly linked list for group leader, and
the same field stores the group leader pointer for group member requests.
For later, we can add one union field to make code more readable.
Will do that in V4.

So you're repurposing it in io_queue_group_members(). Since
it has different meaning at different stages of execution,
it warrants a comment (unless there is one I missed).

+	}
+	return NULL;
+}
+
+void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
+{
+	struct io_kiocb *member = req->grp_link;
+
+	while (member) {
+		struct io_kiocb *next = member->grp_link;
+
+		if (ignore_cqes)
+			member->flags |= REQ_F_CQE_SKIP;
+		if (!(member->flags & REQ_F_FAIL)) {
+			req_set_fail(member);
+			io_req_set_res(member, -ECANCELED, 0);
+		}
+		member = next;
+	}
+}
+
+void io_queue_group_members(struct io_kiocb *req, bool async)
+{
+	struct io_kiocb *member = req->grp_link;
+
+	if (!member)
+		return;
+
+	while (member) {
+		struct io_kiocb *next = member->grp_link;
+
+		member->grp_link = req;
+		if (async)
+			member->flags |= REQ_F_FORCE_ASYNC;
+
+		if (unlikely(member->flags & REQ_F_FAIL)) {
+			io_req_task_queue_fail(member, member->cqe.res);
+		} else if (member->flags & REQ_F_FORCE_ASYNC) {
+			io_req_task_queue(member);
+		} else {
+			io_queue_sqe(member);

io_req_queue_tw_complete() please, just like links deal
with it, so it's executed in a well known context without
jumping ahead of other requests.

+		}
+		member = next;
+	}
+	req->grp_link = NULL;
+}
+
+static inline bool __io_complete_group_req(struct io_kiocb *req,
+			     struct io_kiocb *lead)
+{
+	WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP));
+
+	if (WARN_ON_ONCE(lead->grp_refs <= 0))
+		return false;
+
+	/*
+	 * Set linked leader as failed if any member is failed, so
+	 * the remained link chain can be terminated
+	 */
+	if (unlikely((req->flags & REQ_F_FAIL) &&
+		     ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
+		req_set_fail(lead);
+	return !--lead->grp_refs;
+}
+
+/* Complete group request and collect completed leader for freeing */
+static inline void io_complete_group_req(struct io_kiocb *req,
+		struct io_wq_work_list *grp_list)
+{
+	struct io_kiocb *lead = get_group_leader(req);
+
+	if (__io_complete_group_req(req, lead)) {
+		req->flags &= ~REQ_F_SQE_GROUP;
+		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
+		if (!(lead->flags & REQ_F_CQE_SKIP))
+			io_req_commit_cqe(lead, lead->ctx->lockless_cq);
+
+		if (req != lead) {
+			/*
+			 * Add leader to free list if it isn't there
+			 * otherwise clearing group flag for freeing it
+			 * in current batch
+			 */
+			if (!(lead->flags & REQ_F_SQE_GROUP))
+				wq_list_add_tail(&lead->comp_list, grp_list);
+			else
+				lead->flags &= ~REQ_F_SQE_GROUP;
+		}
+	} else if (req != lead) {
+		req->flags &= ~REQ_F_SQE_GROUP;
+	} else {
+		/*
+		 * Leader's group flag clearing is delayed until it is
+		 * removed from free list
+		 */
+	}
+}
+
   static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
   {
   	struct io_ring_ctx *ctx = req->ctx;
@@ -1427,6 +1545,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
   						    comp_list);
   		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
+			/*
+			 * Group leader may be removed twice, don't free it
+			 * if group flag isn't cleared, when some members
+			 * aren't completed yet
+			 */
+			if (req->flags & REQ_F_SQE_GROUP) {
+				node = req->comp_list.next;
+				req->flags &= ~REQ_F_SQE_GROUP;
+				continue;
+			}
+
   			if (req->flags & REQ_F_REFCOUNT) {
   				node = req->comp_list.next;
   				if (!req_ref_put_and_test(req))
@@ -1459,6 +1588,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
   	__must_hold(&ctx->uring_lock)
   {
   	struct io_submit_state *state = &ctx->submit_state;
+	struct io_wq_work_list grp_list = {NULL};
   	struct io_wq_work_node *node;
   	__io_cq_lock(ctx);
@@ -1468,9 +1598,15 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
   		if (!(req->flags & REQ_F_CQE_SKIP))
   			io_req_commit_cqe(req, ctx->lockless_cq);
+
+		if (req->flags & REQ_F_SQE_GROUP)

Same note about hot path

+			io_complete_group_req(req, &grp_list);
   	}
   	__io_cq_unlock_post(ctx);
+	if (!wq_list_empty(&grp_list))
+		__wq_list_splice(&grp_list, state->compl_reqs.first);

What's the point of splicing it here insted of doing all
that under REQ_F_SQE_GROUP above?

As mentioned, group leader can't be completed until all members are
done, so any leaders in the current list have to be moved to this
local list for deferred completion. That should be the only tricky
part of the whole sqe group implementation.


+
   	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
   		io_free_batch_list(ctx, state->compl_reqs.first);
   		INIT_WQ_LIST(&state->compl_reqs);
...
@@ -1863,6 +2012,8 @@ void io_wq_submit_work(struct io_wq_work *work)
   		}
   	}
+	if (need_queue_group_members(req))
+		io_queue_group_members(req, true);
   	do {
   		ret = io_issue_sqe(req, issue_flags);
   		if (ret != -EAGAIN)
@@ -1977,6 +2128,9 @@ static inline void io_queue_sqe(struct io_kiocb *req)
   	 */
   	if (unlikely(ret))
   		io_queue_async(req, ret);
+
+	if (need_queue_group_members(req))
+		io_queue_group_members(req, false);

Request ownership is considered to be handed further at this
point and requests should not be touched. Only ret==0 from
io_issue_sqe it's still ours, but again it's handed somewhere
by io_queue_async().

Yes, you are right.

And it has been fixed in my local tree:

@@ -2154,8 +2154,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
          */
         if (unlikely(ret))
                 io_queue_async(req, ret);
-
-       if (need_queue_group_members(req))
+       else if (need_queue_group_members(req))
                 io_queue_group_members(req, false);
  }

In the else branch you don't own the request anymore
and shouldn't be poking into it.

It looks like you're trying to do io_queue_group_members()
when previously the request would get completed. It's not
the right place, and apart from whack'a'moled
io_wq_submit_work() there is also io_poll_issue() missed.

Seems __io_submit_flush_completions() / io_free_batch_list()
would be more appropriate, and you already have a chunk with
GROUP check in there handling the leader appearing in there
twice.


   }
   static void io_queue_sqe_fallback(struct io_kiocb *req)
...
@@ -2232,7 +2443,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
   			 const struct io_uring_sqe *sqe)
   	__must_hold(&ctx->uring_lock)
   {
-	struct io_submit_link *link = &ctx->submit_state.link;
+	struct io_submit_state *state = &ctx->submit_state;
   	int ret;
   	ret = io_init_req(ctx, req, sqe);
@@ -2241,9 +2452,17 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
   	trace_io_uring_submit_req(req);
-	if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
-				    REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
-		req = io_link_sqe(link, req);
+	if (unlikely(state->group.head ||

A note rather to myself and for the future, all theese checks
including links and groups can be folded under one common if.

Sorry, I may not get the idea, can you provide one example?

To be clear, not suggesting you doing it.

Simplifying:

init_req() {
	if (req->flags & GROUP|LINK) {
		ctx->assembling;
	}
}

io_submit_sqe() {
	init_req();

	if (ctx->assembling) {
		check_groups/links();
		if (done);
			ctx->assembling = false;
	}
}



We need different logics for group and link, meantime group
has to be handled first before linking, since only the group leader
can be linked.

--
Pavel Begunkov