Re: [PATCH 5/9] io_uring: support SQE group

Ming Lei <ming.lei@xxxxxxxxxx> · Fri, 26 Apr 2024 15:53:41 +0800

On Thu, Apr 25, 2024 at 11:27:32AM +0200, Kevin Wolf wrote:
> Am 24.04.2024 um 03:39 hat Ming Lei geschrieben:
> > On Tue, Apr 23, 2024 at 03:08:55PM +0200, Kevin Wolf wrote:
> > > Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> > > > On 4/7/24 7:03 PM, Ming Lei wrote:
> > > > > SQE group is defined as one chain of SQEs starting with the first sqe that
> > > > > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > > > > doesn't have it set, and it is similar with chain of linked sqes.
> > > > > 
> > > > > The 1st SQE is group leader, and the other SQEs are group member. The group
> > > > > leader is always freed after all members are completed. Group members
> > > > > aren't submitted until the group leader is completed, and there isn't any
> > > > > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > > > > members, same with IOSQE_IO_DRAIN.
> > > > > 
> > > > > Typically the group leader provides or makes resource, and the other members
> > > > > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > > > > read data from source file into fixed buffer, the other SQEs write data from
> > > > > the same buffer into other destination files. SQE group provides very
> > > > > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > > > > submitted in single syscall, no need to submit fs read SQE first, and wait
> > > > > until read SQE is completed, 2) no need to link all write SQEs together, then
> > > > > write SQEs can be submitted to files concurrently. Meantime application is
> > > > > simplified a lot in this way.
> > > > > 
> > > > > Another use case is to for supporting generic device zero copy:
> > > > > 
> > > > > - the lead SQE is for providing device buffer, which is owned by device or
> > > > >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> > > > >   application or panic
> > > > > 
> > > > > - member SQEs reads or writes concurrently against the buffer provided by lead
> > > > >   SQE
> > > > 
> > > > In concept, this looks very similar to "sqe bundles" that I played with
> > > > in the past:
> > > > 
> > > > https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> > > > 
> > > > Didn't look too closely yet at the implementation, but in spirit it's
> > > > about the same in that the first entry is processed first, and there's
> > > > no ordering implied between the test of the members of the bundle /
> > > > group.
> > > 
> > > When I first read this patch, I wondered if it wouldn't make sense to
> > > allow linking a group with subsequent requests, e.g. first having a few
> > > requests that run in parallel and once all of them have completed
> > > continue with the next linked one sequentially.
> > > 
> > > For SQE bundles, you reused the LINK flag, which doesn't easily allow
> > > this. Ming's patch uses a new flag for groups, so the interface would be
> > > more obvious, you simply set the LINK flag on the last member of the
> > > group (or on the leader, doesn't really matter). Of course, this doesn't
> > > mean it has to be implemented now, but there is a clear way forward if
> > > it's wanted.
> > 
> > Reusing LINK for bundle breaks existed link chains (BUNDLE linked to
> > existed link chain), so I think it may not work.
> 
> You can always extend things *somehow*, but it wouldn't fit very
> naturally. That's why I feel your approach on this detail is a little
> better.

Linking group in traditionally way is real use case, please see
ublk-nbd's zero copy implementation.

https://github.com/ublk-org/ublksrv/blob/group-provide-buf/nbd/tgt_nbd.cpp

> 
> > The link rule is explicit for sqe group:
> > 
> > - only group leader can set link flag, which is applied on the whole
> > group: the next sqe in the link chain won't be started until the
> > previous linked sqe group is completed
> > 
> > - link flag can't be set for group members
> > 
> > Also sqe group doesn't limit async for both group leader and member.
> > 
> > sqe group vs link & async is covered in the last liburing test code.
> 
> Oh right, I didn't actually notice that you already implement what I
> proposed!
> 
> I was expecting the flag on the last SQE and I saw in the code that this
> isn't allowed, but I completely missed your comment that explicitly
> states that it's the group leader that gets the link flag. Of course,
> this is just as good.
> 
> > > The part that looks a bit arbitrary in Ming's patch is that the group
> > > leader is always completed before the rest starts. It makes perfect
> > > sense in the context that this series is really after (enabling zero
> > > copy for ublk), but it doesn't really allow the case you mention in the
> > > SQE bundle commit message, running everything in parallel and getting a
> > > single CQE for the whole group.
> > 
> > I think it should be easy to cover bundle in this way, such as add one
> > new op IORING_OP_BUNDLE as Jens did, and implement the single CQE for
> > whole group/bundle.
> 
> This requires an extra SQE compared to just creating the group with
> flags, but I suppose this is not a big problem. An alternative might be
> sending the CQE for the group leader only after the whole group has
> completed if we're okay with userspace never knowing when the leader
> itself completed.
> 
> However, assuming an IORING_OP_BUNDLE command, if this command only
> completes after the whole group, doesn't that conflict with the

Here the completion means committing CQE to userspace ring.

> principle that all other commands are only started after the first one
> has completed?

I meant IORING_OP_BUNDLE is the group leader, and the first one is the
the leader.

The member requests won't be started until the leader is completed, and
here the completion means that the request is completed from subsystem
(FS, netowork, ...), so there isn't conflict, but yes, we need to
describe the whole ideas/terms more carefully.

> 
> Maybe we shouldn't wait for the whole group leader request to complete,
> but just give the group leader a chance to prepare the group before all
> requests in the group (including the leader itself) are run in parallel.
> Maybe io_issue_sqe() could just start the rest of the group somewhere
> after calling def->issue() for the leader. Then you can't prepare the
> group buffer asynchronously, but I don't think this is needed, right?

That isn't true, if leader request is one network RX, we need to wait
until the recv is done, then the following member requests can be
started for consuming the received data.

Same example with the multiple copy one in last patch.

> 
> Your example with one read followed by multiple writes would then have
> to be written slightly differently: First the read outside of the group,
> linked to a group of writes. I honestly think this makes more sense as
> an interface, too, because then links are for sequential things and
> groups are (only) for parallel things. This feels clearer than having
> both a sequential and a parallel element in groups.

Group also implements 1:N dependency, in which N members depends on single
group leader, meantime there isn't any dependency among each members. That
is something the current io_uring is missing.

> 
> > > I suppose you could hack around the sequential nature of the first
> > > request by using an extra NOP as the group leader - which isn't any
> > > worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
> > > the group completion would still be missing. (Of course, removing the
> > > sequential first operation would mean that ublk wouldn't have the buffer
> > > ready any more when the other requests try to use it, so that would
> > > defeat the purpose of the series...)
> > > 
> > > I wonder if we can still combine both approaches and create some
> > > generally useful infrastructure and not something where it's visible
> > > that it was designed mostly for ublk's special case and other use cases
> > > just happened to be enabled as a side effect.
> > 
> > sqe group is actually one generic interface, please see the multiple
> > copy( copy one file to multiple destinations in single syscall for one
> > range) example in the last patch
> 
> Yes, that's an example that happens to work well with the model that you
> derived from ublk.

Not only for ublk and device zero copy, it also have the multiple copy example.

> 
> If you have the opposite case, reading a buffer that is spread across
> multiple files and then writing it to one target (i.e. first step
> parallel, second step sequential), you can't represent this well
> currently. You could work around it by having a NOP leader, but that's
> not very elegant.

Yeah, linking the group(nop & reads) with the following write does
work for the above copy case, :-)

> 
> This asymmetry suggests that it's not the perfect interface yet.

1:N dependency requires the asymmetry, an nothing in this world is perfect, :-)
But we can try to make it better.

> 
> If the whole group runs in parallel instead, including the leader, then
> both examples become symmetrical. You have a group for the parallel I/O
> and a linked single request for the other operation.
> 
> Or if both steps are parallel, you can just have two linked groups.

I think sqe group can be extended to this way easily by one new flag if
there is such real use case. We still can use leader's link flag for
same purpose, an only advance the linking chain until the whole group
is done. 

Then sqe group just degrades to single link group without 1:N dependency
covered and leader is just for providing group link flag, looks it can be
well defined and documented, and could be useful too, IMO.

> 
> > and it can support generic device zero copy: any device internal
> > buffer can be linked with io_uring operations in this way, which can't
> > be done by traditional splice/pipe.
> 
> Is this actually implemented or is it just a potential direction for the
> future?

It is potential direction since sqe group & provide buffer provides one
generic framework to export device internal buffer for further consuming
in zero copy & non-mmap way.

Thanks,
Ming