Re: [PATCH V3 5/9] io_uring: support SQE group

Ming Lei <ming.lei@xxxxxxxxxx> · Mon, 17 Jun 2024 11:54:34 +0800

On Sun, Jun 16, 2024 at 08:13:26PM +0100, Pavel Begunkov wrote:
> On 6/13/24 02:45, Ming Lei wrote:
> > On Mon, Jun 10, 2024 at 03:53:51AM +0100, Pavel Begunkov wrote:
> > > On 5/11/24 01:12, Ming Lei wrote:
> > > > SQE group is defined as one chain of SQEs starting with the first SQE that
> > > > has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
> > > > doesn't have it set, and it is similar with chain of linked SQEs.
> > > 
> > > The main concern stays same, it adds overhead nearly to every
> > > single hot function I can think of, as well as lots of
> > > complexity.
> > 
> > Almost every sqe group change is covered by REQ_F_SQE_GROUP, so I am
> > not clear what the added overhead is.
> 
> Yes, and there is a dozen of such in the hot path.

req->flags is supposed to be L1-cached in all these hot paths, and the
check is basically zero cost, so SQE_GROUP shouldn't add extra cost for
existed io_uring core code path.

> 
> > > Another minor issue is REQ_F_INFLIGHT, as explained before,
> > > cancellation has to be able to find all REQ_F_INFLIGHT
> > > requests. Requests you add to a group can have that flag
> > > but are not discoverable by core io_uring code.
> > 
> > OK, we can deal with it by setting leader as REQ_F_INFLIGHT if the
> > flag is set for any member, since all members are guaranteed to
> > be drained when leader is completed. Will do it in V4.
> 
> Or fail if see one, that's also fine. REQ_F_INFLIGHT is
> only set for POLL requests polling another io_uring.

It is set for read-write/tee/splice op with normal file too, so looks
not safe to fail.

> 
> > > Another note, I'll be looking deeper into this patch, there
> > > is too much of random tossing around of requests / refcounting
> > > and other dependencies, as well as odd intertwinings with
> > > other parts.
> > 
> > The only thing wrt. request refcount is for io-wq, since request
> > reference is grabbed when the req is handled in io-wq context, and
> > group leader need to be completed after all members are done. That
> > is all special change wrt. request refcounting.
> 
> I rather mean refcounting the group leader, even if it's not
> atomic.

If you mean reusing req->refs for refcounting the group leader, it may
not work, cause member can complete from io-wq, but leader may not.
Meantime using dedicated ->grp_refs actually simplifies things a lot.

> 
> > > > diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> > > > index 7a6b190c7da7..62311b0f0e0b 100644
> > > > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> > > > index c184c9a312df..b87c5452de43 100644
> > > > --- a/io_uring/io_uring.c
> > > > +++ b/io_uring/io_uring.c
> ...
> > > >    	}
> > > >    }
> > > > +static inline bool need_queue_group_members(struct io_kiocb *req)
> > > > +{
> > > > +	return req_is_group_leader(req) && req->grp_link;
> > > > +}
> > > > +
> > > > +/* Can only be called after this request is issued */
> > > > +static inline struct io_kiocb *get_group_leader(struct io_kiocb *req)
> > > > +{
> > > > +	if (req->flags & REQ_F_SQE_GROUP) {
> > > > +		if (req_is_group_leader(req))
> > > > +			return req;
> > > > +		return req->grp_link;
> > > 
> > > I'm missing something, it seems io_group_sqe() adding all
> > > requests of a group into a singly linked list via ->grp_link,
> > > but here we return it as a leader. Confused.
> > 
> > ->grp_link stores the singly linked list for group leader, and
> > the same field stores the group leader pointer for group member requests.
> > For later, we can add one union field to make code more readable.
> > Will do that in V4.
> 
> So you're repurposing it in io_queue_group_members(). Since
> it has different meaning at different stages of execution,
> it warrants a comment (unless there is one I missed).

OK, either adding comment or another union field for it.

> 
> > > > +	}
> > > > +	return NULL;
> > > > +}
> > > > +
> > > > +void io_cancel_group_members(struct io_kiocb *req, bool ignore_cqes)
> > > > +{
> > > > +	struct io_kiocb *member = req->grp_link;
> > > > +
> > > > +	while (member) {
> > > > +		struct io_kiocb *next = member->grp_link;
> > > > +
> > > > +		if (ignore_cqes)
> > > > +			member->flags |= REQ_F_CQE_SKIP;
> > > > +		if (!(member->flags & REQ_F_FAIL)) {
> > > > +			req_set_fail(member);
> > > > +			io_req_set_res(member, -ECANCELED, 0);
> > > > +		}
> > > > +		member = next;
> > > > +	}
> > > > +}
> > > > +
> > > > +void io_queue_group_members(struct io_kiocb *req, bool async)
> > > > +{
> > > > +	struct io_kiocb *member = req->grp_link;
> > > > +
> > > > +	if (!member)
> > > > +		return;
> > > > +
> > > > +	while (member) {
> > > > +		struct io_kiocb *next = member->grp_link;
> > > > +
> > > > +		member->grp_link = req;
> > > > +		if (async)
> > > > +			member->flags |= REQ_F_FORCE_ASYNC;
> > > > +
> > > > +		if (unlikely(member->flags & REQ_F_FAIL)) {
> > > > +			io_req_task_queue_fail(member, member->cqe.res);
> > > > +		} else if (member->flags & REQ_F_FORCE_ASYNC) {
> > > > +			io_req_task_queue(member);
> > > > +		} else {
> > > > +			io_queue_sqe(member);
> 
> io_req_queue_tw_complete() please, just like links deal
> with it, so it's executed in a well known context without
> jumping ahead of other requests.

members needn't to be queued until leader is completed for plain
SQE_GROUP, otherwise perf can drop.

> 
> > > > +		}
> > > > +		member = next;
> > > > +	}
> > > > +	req->grp_link = NULL;
> > > > +}
> > > > +
> > > > +static inline bool __io_complete_group_req(struct io_kiocb *req,
> > > > +			     struct io_kiocb *lead)
> > > > +{
> > > > +	WARN_ON_ONCE(!(req->flags & REQ_F_SQE_GROUP));
> > > > +
> > > > +	if (WARN_ON_ONCE(lead->grp_refs <= 0))
> > > > +		return false;
> > > > +
> > > > +	/*
> > > > +	 * Set linked leader as failed if any member is failed, so
> > > > +	 * the remained link chain can be terminated
> > > > +	 */
> > > > +	if (unlikely((req->flags & REQ_F_FAIL) &&
> > > > +		     ((lead->flags & IO_REQ_LINK_FLAGS) && lead->link)))
> > > > +		req_set_fail(lead);
> > > > +	return !--lead->grp_refs;
> > > > +}
> > > > +
> > > > +/* Complete group request and collect completed leader for freeing */
> > > > +static inline void io_complete_group_req(struct io_kiocb *req,
> > > > +		struct io_wq_work_list *grp_list)
> > > > +{
> > > > +	struct io_kiocb *lead = get_group_leader(req);
> > > > +
> > > > +	if (__io_complete_group_req(req, lead)) {
> > > > +		req->flags &= ~REQ_F_SQE_GROUP;
> > > > +		lead->flags &= ~REQ_F_SQE_GROUP_LEADER;
> > > > +		if (!(lead->flags & REQ_F_CQE_SKIP))
> > > > +			io_req_commit_cqe(lead, lead->ctx->lockless_cq);
> > > > +
> > > > +		if (req != lead) {
> > > > +			/*
> > > > +			 * Add leader to free list if it isn't there
> > > > +			 * otherwise clearing group flag for freeing it
> > > > +			 * in current batch
> > > > +			 */
> > > > +			if (!(lead->flags & REQ_F_SQE_GROUP))
> > > > +				wq_list_add_tail(&lead->comp_list, grp_list);
> > > > +			else
> > > > +				lead->flags &= ~REQ_F_SQE_GROUP;
> > > > +		}
> > > > +	} else if (req != lead) {
> > > > +		req->flags &= ~REQ_F_SQE_GROUP;
> > > > +	} else {
> > > > +		/*
> > > > +		 * Leader's group flag clearing is delayed until it is
> > > > +		 * removed from free list
> > > > +		 */
> > > > +	}
> > > > +}
> > > > +
> > > >    static void io_req_complete_post(struct io_kiocb *req, unsigned issue_flags)
> > > >    {
> > > >    	struct io_ring_ctx *ctx = req->ctx;
> > > > @@ -1427,6 +1545,17 @@ static void io_free_batch_list(struct io_ring_ctx *ctx,
> > > >    						    comp_list);
> > > >    		if (unlikely(req->flags & IO_REQ_CLEAN_SLOW_FLAGS)) {
> > > > +			/*
> > > > +			 * Group leader may be removed twice, don't free it
> > > > +			 * if group flag isn't cleared, when some members
> > > > +			 * aren't completed yet
> > > > +			 */
> > > > +			if (req->flags & REQ_F_SQE_GROUP) {
> > > > +				node = req->comp_list.next;
> > > > +				req->flags &= ~REQ_F_SQE_GROUP;
> > > > +				continue;
> > > > +			}
> > > > +
> > > >    			if (req->flags & REQ_F_REFCOUNT) {
> > > >    				node = req->comp_list.next;
> > > >    				if (!req_ref_put_and_test(req))
> > > > @@ -1459,6 +1588,7 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> > > >    	__must_hold(&ctx->uring_lock)
> > > >    {
> > > >    	struct io_submit_state *state = &ctx->submit_state;
> > > > +	struct io_wq_work_list grp_list = {NULL};
> > > >    	struct io_wq_work_node *node;
> > > >    	__io_cq_lock(ctx);
> > > > @@ -1468,9 +1598,15 @@ void __io_submit_flush_completions(struct io_ring_ctx *ctx)
> > > >    		if (!(req->flags & REQ_F_CQE_SKIP))
> > > >    			io_req_commit_cqe(req, ctx->lockless_cq);
> > > > +
> > > > +		if (req->flags & REQ_F_SQE_GROUP)
> > > 
> > > Same note about hot path
> > > 
> > > > +			io_complete_group_req(req, &grp_list);
> > > >    	}
> > > >    	__io_cq_unlock_post(ctx);
> > > > +	if (!wq_list_empty(&grp_list))
> > > > +		__wq_list_splice(&grp_list, state->compl_reqs.first);
> > > 
> > > What's the point of splicing it here insted of doing all
> > > that under REQ_F_SQE_GROUP above?
> > 
> > As mentioned, group leader can't be completed until all members are
> > done, so any leaders in the current list have to be moved to this
> > local list for deferred completion. That should be the only tricky
> > part of the whole sqe group implementation.
> > 
> > > 
> > > > +
> > > >    	if (!wq_list_empty(&ctx->submit_state.compl_reqs)) {
> > > >    		io_free_batch_list(ctx, state->compl_reqs.first);
> > > >    		INIT_WQ_LIST(&state->compl_reqs);
> ...
> > > > @@ -1863,6 +2012,8 @@ void io_wq_submit_work(struct io_wq_work *work)
> > > >    		}
> > > >    	}
> > > > +	if (need_queue_group_members(req))
> > > > +		io_queue_group_members(req, true);
> > > >    	do {
> > > >    		ret = io_issue_sqe(req, issue_flags);
> > > >    		if (ret != -EAGAIN)
> > > > @@ -1977,6 +2128,9 @@ static inline void io_queue_sqe(struct io_kiocb *req)
> > > >    	 */
> > > >    	if (unlikely(ret))
> > > >    		io_queue_async(req, ret);
> > > > +
> > > > +	if (need_queue_group_members(req))
> > > > +		io_queue_group_members(req, false);
> > > 
> > > Request ownership is considered to be handed further at this
> > > point and requests should not be touched. Only ret==0 from
> > > io_issue_sqe it's still ours, but again it's handed somewhere
> > > by io_queue_async().
> > 
> > Yes, you are right.
> > 
> > And it has been fixed in my local tree:
> > 
> > @@ -2154,8 +2154,7 @@ static inline void io_queue_sqe(struct io_kiocb *req)
> >           */
> >          if (unlikely(ret))
> >                  io_queue_async(req, ret);
> > -
> > -       if (need_queue_group_members(req))
> > +       else if (need_queue_group_members(req))
> >                  io_queue_group_members(req, false);
> >   }
> 
> In the else branch you don't own the request anymore
> and shouldn't be poking into it.

In theory, it is yes, but now all requests won't be freed unless
returning from io_queue_sqe(), and it needs to be commented
carefully.

> 
> It looks like you're trying to do io_queue_group_members()
> when previously the request would get completed. It's not

It is only true for REQ_F_SQE_GROUP_DEP, and there isn't such
dependency for plain SQE_GROUP.

> the right place, and apart from whack'a'moled
> io_wq_submit_work() there is also io_poll_issue() missed.
> 
> Seems __io_submit_flush_completions() / io_free_batch_list()
> would be more appropriate, and you already have a chunk with
> GROUP check in there handling the leader appearing in there
> twice.

As mentioned, we need to queue members with leader together
if there isn't dependency among them.

> 
> 
> > > >    }
> > > >    static void io_queue_sqe_fallback(struct io_kiocb *req)
> ...
> > > > @@ -2232,7 +2443,7 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > >    			 const struct io_uring_sqe *sqe)
> > > >    	__must_hold(&ctx->uring_lock)
> > > >    {
> > > > -	struct io_submit_link *link = &ctx->submit_state.link;
> > > > +	struct io_submit_state *state = &ctx->submit_state;
> > > >    	int ret;
> > > >    	ret = io_init_req(ctx, req, sqe);
> > > > @@ -2241,9 +2452,17 @@ static inline int io_submit_sqe(struct io_ring_ctx *ctx, struct io_kiocb *req,
> > > >    	trace_io_uring_submit_req(req);
> > > > -	if (unlikely(link->head || (req->flags & (IO_REQ_LINK_FLAGS |
> > > > -				    REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
> > > > -		req = io_link_sqe(link, req);
> > > > +	if (unlikely(state->group.head ||
> > > 
> > > A note rather to myself and for the future, all theese checks
> > > including links and groups can be folded under one common if.
> > 
> > Sorry, I may not get the idea, can you provide one example?
> 
> To be clear, not suggesting you doing it.
> 
> Simplifying:
> 
> init_req() {
> 	if (req->flags & GROUP|LINK) {
> 		ctx->assembling;
> 	}
> }
> 
> io_submit_sqe() {
> 	init_req();
> 
> 	if (ctx->assembling) {
> 		check_groups/links();
> 		if (done);
> 			ctx->assembling = false;
> 	}
> }

OK, I can work toward this way, and it is just to replace check over
group.head/link.head & link/group flag with ->assembling, meantime
with cost of setting ctx->assembling.

Thanks,
Ming