On 10/29/24 01:50, Ming Lei wrote:
On Mon, Oct 28, 2024 at 06:12:34PM -0600, Jens Axboe wrote:
On 10/25/24 6:22 AM, Ming Lei wrote:
SQE group is defined as one chain of SQEs starting with the first SQE that
has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that
doesn't have it set, and it is similar with chain of linked SQEs.
Not like linked SQEs, each sqe is issued after the previous one is
completed. All SQEs in one group can be submitted in parallel. To simplify
the implementation from beginning, all members are queued after the leader
is completed, however, this way may be changed and leader and members may
be issued concurrently in future.
The 1st SQE is group leader, and the other SQEs are group member. The whole
group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and
the two flags can't be set for group members. For the sake of
simplicity, IORING_OP_LINK_TIMEOUT is disallowed for SQE group now.
When the group is in one link chain, this group isn't submitted until the
previous SQE or group is completed. And the following SQE or group can't
be started if this group isn't completed. Failure from any group member will
fail the group leader, then the link chain can be terminated.
When IOSQE_IO_DRAIN is set for group leader, all requests in this group and
previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for
group leader only, we respect IO_DRAIN by always completing group leader as
the last one in the group. Meantime it is natural to post leader's CQE
as the last one from application viewpoint.
Working together with IOSQE_IO_LINK, SQE group provides flexible way to
support N:M dependency, such as:
- group A is chained with group B together
- group A has N SQEs
- group B has M SQEs
then M SQEs in group B depend on N SQEs in group A.
N:M dependency can support some interesting use cases in efficient way:
1) read from multiple files, then write the read data into single file
2) read from single file, and write the read data into multiple files
3) write same data into multiple files, and read data from multiple files and
compare if correct data is written
Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can
extend sqe->flags with io_uring context flag, such as use __pad3 for
non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP.
Since it's taking the last flag, maybe a better idea to have the last
flag mean "more flags in (for example) __pad3" and put the new flag
there? Not sure you mean in terms of "io_uring context flag", would it
be an enter flag? Ring required to be setup with a certain flag? Neither
of those seem super encouraging, imho.
I meant:
If "more flags in __pad3" is enabled in future we may claim it as one
feature to userspace, such as IORING_FEAT_EXT_FLAG.
Will improve the above commit log.
And we can't take it in either case. The field is in a union, and
other opcodes use that part of the SQE. Enabling a generic feature
for a subset of requests only is not a good idea.
--
Pavel Begunkov