Re: [PATCH V3 5/9] io_uring: support SQE group

Pavel Begunkov <asml.silence@xxxxxxxxx> · Sun, 16 Jun 2024 19:14:37 +0100

On 6/11/24 14:32, Ming Lei wrote:
On Mon, Jun 10, 2024 at 02:55:22AM +0100, Pavel Begunkov wrote:
On 5/21/24 03:58, Ming Lei wrote:
On Sat, May 11, 2024 at 08:12:08AM +0800, Ming Lei wrote:
SQE group is defined as one chain of SQEs starting with the first SQE that
has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
doesn't have it set, and it is similar with chain of linked SQEs.

Not like linked SQEs, each sqe is issued after the previous one is completed.
All SQEs in one group are submitted in parallel, so there isn't any dependency
among SQEs in one group.

The 1st SQE is group leader, and the other SQEs are group member. The whole
group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
the two flags are ignored for group members.

When the group is in one link chain, this group isn't submitted until the
previous SQE or group is completed. And the following SQE or group can't
be started if this group isn't completed. Failure from any group member will
fail the group leader, then the link chain can be terminated.

When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
group leader only, we respect IO_DRAIN by always completing group leader as
the last one in the group.

Working together with IOSQE_IO_LINK, SQE group provides flexible way to
support N:M dependency, such as:

- group A is chained with group B together
- group A has N SQEs
- group B has M SQEs

then M SQEs in group B depend on N SQEs in group A.

N:M dependency can support some interesting use cases in efficient way:

1) read from multiple files, then write the read data into single file

2) read from single file, and write the read data into multiple files

3) write same data into multiple files, and read data from multiple files and
compare if correct data is written

Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
extend sqe->flags with one uring context flag, such as use __pad3 for
non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.

Suggested-by: Kevin Wolf <kwolf@xxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>

BTW, I wrote one link-grp-cp.c liburing/example which is based on sqe group,
and keep QD not changed, just re-organize IOs in the following ways:

- each group have 4 READ IOs, linked by one single write IO for writing
    the read data in sqe group to destination file

IIUC it's comparing 1 large write request with 4 small, and

It is actually reasonable from storage device viewpoint, concurrent
small READs are often fast than single big READ, but concurrent small
writes are usually slower.

It is, but that doesn't make the comparison apple to apple.
Even what I described, even though it's better (same number
of syscalls but better parallelism as you don't block next
batch of reads by writes), you can argues it's not a
completely fair comparison either since needs different number
of buffers, etc.

it's not exactly anything close to fair. And you can do same
in userspace (without links). And having control in userspace

No, you can't do it with single syscall.

That's called you _can_ do it. And syscalls is not everything,
context switching turned to be a bigger problem, and to execute
links it does exactly that.

--
Pavel Begunkov