Re: [PATCH 5/9] io_uring: support SQE group

Pavel Begunkov <asml.silence@xxxxxxxxx> · Tue, 30 Apr 2024 13:27:10 +0100

On 4/30/24 04:03, Ming Lei wrote:
On Mon, Apr 29, 2024 at 04:32:35PM +0100, Pavel Begunkov wrote:
On 4/23/24 14:08, Kevin Wolf wrote:
Am 22.04.2024 um 20:27 hat Jens Axboe geschrieben:
On 4/7/24 7:03 PM, Ming Lei wrote:
SQE group is defined as one chain of SQEs starting with the first sqe that
has IOSQE_EXT_SQE_GROUP set, and ending with the first subsequent sqe that
doesn't have it set, and it is similar with chain of linked sqes.

The 1st SQE is group leader, and the other SQEs are group member. The group
leader is always freed after all members are completed. Group members
aren't submitted until the group leader is completed, and there isn't any
dependency among group members, and IOSQE_IO_LINK can't be set for group
members, same with IOSQE_IO_DRAIN.

Typically the group leader provides or makes resource, and the other members
consume the resource, such as scenario of multiple backup, the 1st SQE is to
read data from source file into fixed buffer, the other SQEs write data from
the same buffer into other destination files. SQE group provides very
efficient way to complete this task: 1) fs write SQEs and fs read SQE can be
submitted in single syscall, no need to submit fs read SQE first, and wait
until read SQE is completed, 2) no need to link all write SQEs together, then
write SQEs can be submitted to files concurrently. Meantime application is
simplified a lot in this way.

Another use case is to for supporting generic device zero copy:

- the lead SQE is for providing device buffer, which is owned by device or
    kernel, can't be cross userspace, otherwise easy to cause leak for devil
    application or panic

- member SQEs reads or writes concurrently against the buffer provided by lead
    SQE

In concept, this looks very similar to "sqe bundles" that I played with
in the past:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-bundle

Didn't look too closely yet at the implementation, but in spirit it's
about the same in that the first entry is processed first, and there's
no ordering implied between the test of the members of the bundle /
group.

When I first read this patch, I wondered if it wouldn't make sense to
allow linking a group with subsequent requests, e.g. first having a few
requests that run in parallel and once all of them have completed
continue with the next linked one sequentially.

For SQE bundles, you reused the LINK flag, which doesn't easily allow
this. Ming's patch uses a new flag for groups, so the interface would be
more obvious, you simply set the LINK flag on the last member of the
group (or on the leader, doesn't really matter). Of course, this doesn't
mean it has to be implemented now, but there is a clear way forward if
it's wanted.

Putting zc aside, links, graphs, groups, it all sounds interesting in
concept but let's not fool anyone, all the different ordering
relationships between requests proved to be a bad idea.

As Jens mentioned, sqe group is very similar with bundle:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-bundle

which is really something io_uring is missing.

One could've said same about links, retrospectively I argue that it
was a mistake, so I pretty much doubt arguments like "io_uring is
missing it". Another thing is that zero copy, which is not possible
to implement by returning to the userspace.

I can complaint for long, error handling is miserable, user handling
resubmitting a part of a link is horrible, the concept of errors is
hard coded (time to appreciate "beautifulness" of IOSQE_IO_HARDLINK
and the MSG_WAITALL workaround). The handling and workarounds are
leaking into generic paths, e.g. we can't init files when it's the most
convenient. For cancellation we're walking links, which need more care
than just looking at a request (is cancellation by user_data of a
"linked" to a group request even supported?). The list goes on

Only the group leader is linked, if the group leader is canceled, all
requests in the whole group will be canceled.

But yes, cancelling by user_data for group members can't be supported,
and it can be documented clearly, since user still can cancel the whole
group with group leader's user_data.

Which means it'd break the case REQ_F_INFLIGHT covers, and you need
to disallow linking REQ_F_INFLIGHT marked requests.

And what does it achieve? The infra has matured since early days,
it saves user-kernel transitions at best but not context switching
overhead, and not even that if you do wait(1) and happen to catch
middle CQEs. And it disables LAZY_WAKE, so CQ side batching with
timers and what not is effectively useless with links.

Not only the context switch, it supports 1:N or N:M dependency which

I completely missed, how N:M is supported? That starting to sound
terrifying.

is missing in io_uring, but also makes async application easier to write by
saving extra context switches, which just adds extra intermediate states for
application.

You're still executing requests (i.e. ->issue) primarily from the
submitter task context, they would still fly back to the task and
wake it up. You may save something by completing all of them
together via that refcounting, but you might just as well try to
batch CQ, which is a more generic issue. It's not clear what
context switches you save then.

As for simplicity, using the link example and considering error
handling, it only complicates it. In case of an error you need to
figure out a middle req failed, collect all failed CQEs linked to
it and automatically cancelled (unless SKIP_COMPLETE is used), and
then resubmit the failed. That's great your reads are idempotent
and presumably you don't have to resubmit half a link, but in the
grand picture of things it's rather one of use cases where a generic
feature can be used.

So, please, please! instead of trying to invent a new uber scheme
of request linking, which surely wouldn't step on same problems
over and over again, and would definitely be destined to overshadow
all previous attempts and finally conquer the world, let's rather
focus on minimasing the damage from this patchset's zero copy if
it's going to be taken.

One key problem for zero copy is lifetime of the kernel buffer, which
can't cross OPs, that is why sqe group is introduced, for aligning
kernel buffer lifetime with the group.

Right, which is why I'm saying if we're leaving groups with zero
copy, let's rather try to make them simple and not intrusive as
much as possible, instead of creating an unsupportable overarching
beast out of it, which would fail as a generic feature.

--
Pavel Begunkov