Re: [PATCH V3 5/9] io_uring: support SQE group

Pavel Begunkov <asml.silence@xxxxxxxxx> · Mon, 10 Jun 2024 02:55:22 +0100

On 5/21/24 03:58, Ming Lei wrote:
On Sat, May 11, 2024 at 08:12:08AM +0800, Ming Lei wrote:
SQE group is defined as one chain of SQEs starting with the first SQE that
has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
doesn't have it set, and it is similar with chain of linked SQEs.

Not like linked SQEs, each sqe is issued after the previous one is completed.
All SQEs in one group are submitted in parallel, so there isn't any dependency
among SQEs in one group.

The 1st SQE is group leader, and the other SQEs are group member. The whole
group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
the two flags are ignored for group members.

When the group is in one link chain, this group isn't submitted until the
previous SQE or group is completed. And the following SQE or group can't
be started if this group isn't completed. Failure from any group member will
fail the group leader, then the link chain can be terminated.

When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
group leader only, we respect IO_DRAIN by always completing group leader as
the last one in the group.

Working together with IOSQE_IO_LINK, SQE group provides flexible way to
support N:M dependency, such as:

- group A is chained with group B together
- group A has N SQEs
- group B has M SQEs

then M SQEs in group B depend on N SQEs in group A.

N:M dependency can support some interesting use cases in efficient way:

1) read from multiple files, then write the read data into single file

2) read from single file, and write the read data into multiple files

3) write same data into multiple files, and read data from multiple files and
compare if correct data is written

Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
extend sqe->flags with one uring context flag, such as use __pad3 for
non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.

Suggested-by: Kevin Wolf <kwolf@xxxxxxxxxx>
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>

BTW, I wrote one link-grp-cp.c liburing/example which is based on sqe group,
and keep QD not changed, just re-organize IOs in the following ways:

- each group have 4 READ IOs, linked by one single write IO for writing
   the read data in sqe group to destination file

IIUC it's comparing 1 large write request with 4 small, and
it's not exactly anything close to fair. And you can do same
in userspace (without links). And having control in userspace
you can do more fun tricks, like interleaving writes for one
batch with reads from the next batch.

- the 1st 12 groups have (4 + 1) IOs, and the last group have (3 + 1)
   IOs

Run the example for copying two block device(from virtio-blk to
virtio-scsi in my test VM):

1) buffered copy:
- perf is improved by 5%

2) direct IO mode
- perf is improved by 27%

[1] link-grp-cp.c example

https://github.com/ming1/liburing/commits/sqe_group_v2/

[2] one bug fixes(top commit) against V3

https://github.com/ming1/linux/commits/io_uring_sqe_group_v3/

Thanks,
Ming

--
Pavel Begunkov