Re: [PATCH V4 4/8] io_uring: support SQE group

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 29 Jul 2024 14:58:58 +0100

On 7/25/24 11:33, Ming Lei wrote:
On Wed, Jul 24, 2024 at 02:41:38PM +0100, Pavel Begunkov wrote:
On 7/23/24 15:34, Ming Lei wrote:
...
But grp_refs is dropped after io-wq request reference drops to
zero, then both io-wq and nor-io-wq code path can be unified
wrt. dealing with grp_refs, meantime it needn't to be updated
in extra(io-wq) context.

Let's try to describe how it can work. First, I'm only describing
the dep mode for simplicity. And for the argument's sake we can say
that all CQEs are posted via io_submit_flush_completions.

io_req_complete_post() {
	if (flags & GROUP) {
		req->io_task_work.func = io_req_task_complete;
		io_req_task_work_add(req);
		return;
	}
	...
}

OK.

io_wq_free_work() still need to change to not deal with
next link & ignoring skip_cqe, because group handling(

No, it doesn't need to know about all that.

cqe posting, link advance) is completely moved into
io_submit_flush_completions().

It has never been guaranteed that io_req_complete_post()
will be the one completing the request,
io_submit_flush_completions() can always happen.

struct io_wq_work *io_wq_free_work(struct io_wq_work *work)
{
	...
	if (req_ref_put_and_test(req)) {
		nxt = io_req_find_next(req);
		io_free_req();
	}
}

We queue linked requests only when all refs are dropped, and
the group handling in my snippet is done before we drop the
owner's reference.

IOW, you won't hit io_free_req() in io_wq_free_work() for a
leader unless all members in its group got completed and
the leader already went through the code dropping those shared
ublk buffers.

You can do it this way, nobody would ever care, and it shouldn't
affect performance. Otherwise everything down below can probably
be extended to io_req_complete_post().

To avoid confusion in terminology, what I call a member below doesn't
include a leader. IOW, a group leader request is not a member.

At the init we have:
grp_refs = nr_members; /* doesn't include the leader */

Let's also say that the group leader can and always goes
through io_submit_flush_completions() twice, just how it's
with your patches.

1) The first time we see the leader in io_submit_flush_completions()
is when it's done with resource preparation. For example, it was
doing some IO into a buffer, and now is ready to give that buffer
with data to group members. At this point it should queue up all group
members. And we also drop 1 grp_ref. There will also be no
io_issue_sqe() for it anymore.

Ok, then it is just the case with dependency.

2) Members are executed and completed, in io_submit_flush_completions()
they drop 1 grp_leader->grp_ref reference each.

3) When all members complete, leader's grp_ref becomes 0. Here
the leader is queued for io_submit_flush_completions() a second time,
at which point it drops ublk buffers and such and gets destroyed.

You can post a CQE in 1) and then set CQE_SKIP. Can also be fitted
into 3). A pseudo code for when we post it in step 1)

This way should work, but it confuses application because
the leader is completed before all members:

- leader usually provide resources in group wide
- member consumes this resource
- leader is supposed to be completed after all consumer(member) are
done.

Given it is UAPI, we have to be careful with CQE posting order.

That's exactly what I was telling about the uapi previously,
I don't believe we want to inverse the natural CQE order.

Regardless, the order can be inversed like this:

__io_flush_completions() {
	...
	if (req->flags & (SKIP_CQE|GROUP)) {
		if (req->flags & SKIP_CQE)
			continue;
		// post a CQE only when we see it for a 2nd time in io_flush_completion();
		if (is_leader() && !(req->flags & ALREADY_SEEN))
			continue;
	}
	post_cqe();
}

io_free_batch_list() {
	if (req->flags & GROUP) {
		if (req_is_member(req)) {
			req->grp_leader->grp_refs--;
			if (req->grp_leader->grp_refs == 0) {
				req->io_task_work.func = io_req_task_complete;
				io_req_task_work_add(req->grp_leader);
				// can be done better....
			}
			goto free_req;
		}
		WARN_ON_ONCE(!req_is_leader());

		if (!(req->flags & SEEN_FIRST_TIME)) {
			// already posted it just before coming here
			req->flags |= SKIP_CQE;
			// we'll see it again when grp_refs hit 0
			req->flags |= SEEN_FIRST_TIME;

			// Don't free the req, we're leaving it alive for now.
			// req->ref/REQ_F_REFCOUNT will be put next time we get here.
			return; // or continue
		}

		clean_up_request_resources(); // calls back into ublk
		// and now free the leader
	}

free_req:
	// the rest of io_free_batch_list()
	if (flags & REQ_F_REFCOUNT) {
		req_drop_ref();
		....
	}
	...
}

--
Pavel Begunkov