Re: [PATCH 5/9] io_uring: support SQE group

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
> On 4/7/24 7:03 PM, Ming Lei wrote:
> > SQE group is defined as one chain of SQEs starting with the first sqe that
> > has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
> > doesn't have it set, and it is similar with chain of linked sqes.
> > 
> > The 1st SQE is group leader, and the other SQEs are group member. The group
> > leader is always freed after all members are completed. Group members
> > aren't submitted until the group leader is completed, and there isn't any
> > dependency among group members, and IOSQE_IO_LINK can't be set for group
> > members, same with IOSQE_IO_DRAIN.
> > 
> > Typically the group leader provides or makes resource, and the other members
> > consume the resource, such as scenario of multiple backup, the 1st SQE is to
> > read data from source file into fixed buffer, the other SQEs write data from
> > the same buffer into other destination files. SQE group provides very
> > efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
> > submitted in single syscall, no need to submit fs read SQE first, and wait
> > until read SQE is completed, 2) no need to link all write SQEs together, then
> > write SQEs can be submitted to files concurrently. Meantime application is
> > simplified a lot in this way.
> > 
> > Another use case is to for supporting generic device zero copy:
> > 
> > - the lead SQE is for providing device buffer, which is owned by device or
> >   kernel, can't be cross userspace, otherwise easy to cause leak for devil
> >   application or panic
> > 
> > - member SQEs reads or writes concurrently against the buffer provided by lead
> >   SQE
> 
> In concept, this looks very similar to "sqe bundles" that I played with
> in the past:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle
> 
> Didn't look too closely yet at the implementation, but in spirit it's
> about the same in that the first entry is processed first, and there's
> no ordering implied between the test of the members of the bundle /
> group.

When I first read this patch, I wondered if it wouldn't make sense to
allow linking a group with subsequent requests, e.g. first having a few
requests that run in parallel and once all of them have completed
continue with the next linked one sequentially.

For SQE bundles, you reused the LINK flag, which doesn't easily allow
this. Ming's patch uses a new flag for groups, so the interface would be
more obvious, you simply set the LINK flag on the last member of the
group (or on the leader, doesn't really matter). Of course, this doesn't
mean it has to be implemented now, but there is a clear way forward if
it's wanted.

The part that looks a bit arbitrary in Ming's patch is that the group
leader is always completed before the rest starts. It makes perfect
sense in the context that this series is really after (enabling zero
copy for ublk), but it doesn't really allow the case you mention in the
SQE bundle commit message, running everything in parallel and getting a
single CQE for the whole group.

I suppose you could hack around the sequential nature of the first
request by using an extra NOP as the group leader - which isn't any
worse than having an IORING_OP_BUNDLE really, just looks a bit odd - but
the group completion would still be missing. (Of course, removing the
sequential first operation would mean that ublk wouldn't have the buffer
ready any more when the other requests try to use it, so that would
defeat the purpose of the series...)

I wonder if we can still combine both approaches and create some
generally useful infrastructure and not something where it's visible
that it was designed mostly for ublk's special case and other use cases
just happened to be enabled as a side effect.

Kevin





[Index of Archives]     [Linux RAID]     [Linux SCSI]     [Linux ATA RAID]     [IDE]     [Linux Wireless]     [Linux Kernel]     [ATH6KL]     [Linux Bluetooth]     [Linux Netdev]     [Kernel Newbies]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Device Mapper]

  Powered by Linux