Re: [PATCH V5 4/8] io_uring: support SQE group

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 27 Aug 2024 16:18:26 +0100

On 8/8/24 17:24, Ming Lei wrote:
SQE group is defined as one chain of SQEs starting with the first SQE that
has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
doesn't have it set, and it is similar with chain of linked SQEs.

Not like linked SQEs, each sqe is issued after the previous one is
completed. All SQEs in one group can be submitted in parallel. To simplify
the implementation from beginning, all members are queued after the leader
is completed, however, this way may be changed and leader and members may
be issued concurrently in future.

The 1st SQE is group leader, and the other SQEs are group member. The whole
group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
the two flags can't be set for group members.

When the group is in one link chain, this group isn't submitted until the
previous SQE or group is completed. And the following SQE or group can't
be started if this group isn't completed. Failure from any group member will
fail the group leader, then the link chain can be terminated.

When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
group leader only, we respect IO_DRAIN by always completing group leader as
the last one in the group. Meantime it is natural to post leader's CQE
as the last one from application viewpoint.

Working together with IOSQE_IO_LINK, SQE group provides flexible way to
support N:M dependency, such as:

- group A is chained with group B together
- group A has N SQEs
- group B has M SQEs

then M SQEs in group B depend on N SQEs in group A.

N:M dependency can support some interesting use cases in efficient way:

1) read from multiple files, then write the read data into single file

2) read from single file, and write the read data into multiple files

3) write same data into multiple files, and read data from multiple files and
compare if correct data is written

Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
extend sqe->flags with one uring context flag, such as use __pad3 for
non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.

Suggested-by: Kevin Wolf <kwolf@xxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
---
  include/linux/io_uring_types.h |  18 +++
...

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index f112e9fa90d8..45a292445b18 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
...
@@ -875,6 +877,116 @@ static __always_inline void io_req_commit_cqe(struct io_ring_ctx *ctx,
  	}
  }
...
+static void io_queue_group_members(struct io_kiocb *req)
+{
+	struct io_kiocb *member = req->grp_link;
+
+	if (!member)
+		return;
+
+	req->grp_link = NULL;
+	while (member) {
+		struct io_kiocb *next = member->grp_link;
+
+		member->grp_leader = req;
+		if (unlikely(member->flags & REQ_F_FAIL)) {
+			io_req_task_queue_fail(member, member->cqe.res);
+		} else if (unlikely(req->flags & REQ_F_FAIL)) {
+			io_req_task_queue_fail(member, -ECANCELED);
+		} else {
+			io_req_task_queue(member);
+		}
+		member = next;
+	}
+}
+
+static inline bool __io_complete_group_member(struct io_kiocb *req,
+			     struct io_kiocb *lead)
+{

I think it'd be better if you inline this function, it only
obfuscates things.

+	if (WARN_ON_ONCE(lead->grp_refs <= 0))
+		return false;
+
+	req->flags &= ~REQ_F_SQE_GROUP;

I'm getting completely lost when and why it clears and sets
back REQ_F_SQE_GROUP and REQ_F_SQE_GROUP_LEADER. Is there any
rule?

+	/*
+	 * Set linked leader as failed if any member is failed, so
+	 * the remained link chain can be terminated
+	 */
+	if (unlikely((req->flags & REQ_F_FAIL) &&
+		     ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
+		req_set_fail(lead);

if (req->flags & REQ_F_FAIL)
	req_set_fail(lead);

REQ_F_FAIL is not specific to links, if a request fails we need
to mark it as such.


+	return !--lead->grp_refs;
+}
+
+static inline bool leader_is_the_last(struct io_kiocb *lead)
+{
+	return lead->grp_refs == 1 && (lead->flags & REQ_F_SQE_GROUP);
+}
+
+static void io_complete_group_member(struct io_kiocb *req)
+{
+	struct io_kiocb *lead = get_group_leader(req);
+
+	if (WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP)))
+		return;
+
+	/* member CQE needs to be posted first */
+	if (!(req->flags & REQ_F_CQE_SKIP))
+		io_req_commit_cqe(req->ctx, req);
+
+	if (__io_complete_group_member(req, lead)) {
+		/*
+		 * SQE_GROUP flag is kept for the last member, so the leader
+		 * can be retrieved & freed from this last member
+		 */
+		req->flags |= REQ_F_SQE_GROUP;
+		if (!(lead->flags & REQ_F_CQE_SKIP))
+			io_req_commit_cqe(lead->ctx, lead);
+	} else if (leader_is_the_last(lead)) {
+		/* leader will degenerate to plain req if it is the last */
+		lead->flags &= ~(REQ_F_SQE_GROUP | REQ_F_SQE_GROUP_LEADER);

What's this chunk is about?

+	}
+}
+
+static void io_complete_group_leader(struct io_kiocb *req)
+{
+	WARN_ON_ONCE(req->grp_refs <= 0);
+	req->flags &= ~REQ_F_SQE_GROUP;
+	req->grp_refs -= 1;
+	WARN_ON_ONCE(req->grp_refs == 0);

Why not combine these WARN_ON_ONCE into one?

+
+	/* TODO: queue members with leader in parallel */

no todos, please

+	if (req->grp_link)
+		io_queue_group_members(req);
+}

It's spinlock'ed, we really don't want to do too much here
like potentially queueing a ton of task works.
io_queue_group_members() can move into io_free_batch_list().

+
  static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
  {
  	struct io_ring_ctx *ctx = req->ctx;
@@ -890,7 +1002,8 @@ static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
  	 * Handle special CQ sync cases via task_work. DEFER_TASKRUN requires
  	 * the submitter task context, IOPOLL protects with uring_lock.
  	 */
-	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)) {
+	if (ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL) ||
+	    req_is_group_leader(req)) {
  		req->io_task_work.func = io_req_task_complete;
  		io_req_task_work_add(req);
  		return;
@@ -1388,11 +1501,33 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
  						    comp_list);
  
  		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
+			if (req->flags & (REQ_F_SQE_GROUP |
+					  REQ_F_SQE_GROUP_LEADER)) {
+				struct io_kiocb *leader;
+
+				/* Leader is freed via the last member */
+				if (req_is_group_leader(req)) {
+					node = req->comp_list.next;
+					continue;
+				}
+
+				/*
+				 * Only the last member keeps GROUP flag,
+				 * free leader and this member together
+				 */
+				leader = get_group_leader(req);
+				leader->flags &= ~REQ_F_SQE_GROUP_LEADER;
+				req->flags &= ~REQ_F_SQE_GROUP;
+				wq_stack_add_head(&leader->comp_list,
+						  &req->comp_list);

That's quite hacky, but at least we can replace it with
task work if it gets in the way later on.

+			}
+
  			if (req->flags & REQ_F_REFCOUNT) {
  				node = req->comp_list.next;
  				if (!req_ref_put_and_test(req))
  					continue;
  			}
+
  			if ((req->flags & REQ_F_POLLED) && req->apoll) {
  				struct async_poll *apoll = req->apoll;
  
@@ -1427,8 +1562,19 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
  		struct io_kiocb *req = container_of(node, struct io_kiocb,
  					    comp_list);
  
-		if (!(req->flags & REQ_F_CQE_SKIP))
-			io_req_commit_cqe(ctx, req);
+		if (unlikely(req->flags & (REQ_F_CQE_SKIP | REQ_F_SQE_GROUP))) {
+			if (req->flags & REQ_F_SQE_GROUP) {
+				if (req_is_group_leader(req))
+					io_complete_group_leader(req);
+				else
+					io_complete_group_member(req);
+				continue;
+			}
+
+			if (req->flags & REQ_F_CQE_SKIP)
+				continue;
+		}
+		io_req_commit_cqe(ctx, req);
  	}
  	__io_cq_unlock_post(ctx);
  
...
@@ -2101,6 +2251,62 @@ static int io_init_req(struct io_ring_ctx *ctx, struct io_kiocb *req,
  	return def->prep(req, sqe);
  }
  
+static struct io_kiocb *io_group_sqe(struct io_submit_link *group,
+				     struct io_kiocb *req)
+{
+	/*
+	 * Group chain is similar with link chain: starts with 1st sqe with
+	 * REQ_F_SQE_GROUP, and ends with the 1st sqe without REQ_F_SQE_GROUP
+	 */
+	if (group->head) {
+		struct io_kiocb *lead = group->head;
+
+		/* members can't be in link chain, can't be drained */
+		if (req->flags & (IO_REQ_LINK_FLAGS | REQ_F_IO_DRAIN))
+			req_fail_link_node(lead, -EINVAL);

That should fail the entire link (if any) as well.

I have even more doubts we even want to mix links and groups. Apart
from nuances as such, which would be quite hard to track, the semantics
of IOSQE_CQE_SKIP_SUCCESS is unclear. And also it doen't work with
IORING_OP_LINK_TIMEOUT.

+
+		lead->grp_refs += 1;
+		group->last->grp_link = req;
+		group->last = req;
+
+		if (req->flags & REQ_F_SQE_GROUP)
+			return NULL;
+
+		req->grp_link = NULL;
+		req->flags |= REQ_F_SQE_GROUP;
+		group->head = NULL;
+		if (lead->flags & REQ_F_FAIL) {
+			io_queue_sqe_fallback(lead);

Let's say the group was in the middle of a link, it'll
complete that group and continue with assembling / executing
the link when it should've failed it and honoured the
request order.


+			return NULL;
+		}
+		return lead;
+	} else if (req->flags & REQ_F_SQE_GROUP) {
+		group->head = req;
+		group->last = req;
+		req->grp_refs = 1;
+		req->flags |= REQ_F_SQE_GROUP_LEADER;
+		return NULL;
+	} else {

We shouldn't be able to get here.

if (WARN_ONCE(!(req->flags & GROUP)))
	...
group->head = req;
...

+		return req;
+	}
+}
+
+static __cold struct io_kiocb *io_submit_fail_group(
+		struct io_submit_link *link, struct io_kiocb *req)
+{
+	struct io_kiocb *lead = link->head;
+
+	/*
+	 * Instead of failing eagerly, continue assembling the group link
+	 * if applicable and mark the leader with REQ_F_FAIL. The group
+	 * flushing code should find the flag and handle the rest
+	 */
+	if (lead && (lead->flags & IO_REQ_LINK_FLAGS) && !(lead->flags & REQ_F_FAIL))
+		req_fail_link_node(lead, -ECANCELED);

Same as above, you don't need to check for IO_REQ_LINK_FLAGS.
io_queue_sqe_fallback()


+
+	return io_group_sqe(link, req);
+}
+

--
Pavel Begunkov