On Sun, Jun 16, 2024 at 07:14:37PM +0100, Pavel Begunkov wrote: > On 6/11/24 14:32, Ming Lei wrote: > > On Mon, Jun 10, 2024 at 02:55:22AM +0100, Pavel Begunkov wrote: > > > On 5/21/24 03:58, Ming Lei wrote: > > > > On Sat, May 11, 2024 at 08:12:08AM +0800, Ming Lei wrote: > > > > > SQE group is defined as one chain of SQEs starting with the first SQE that > > > > > has IOSQE_SQE_GROUP set, and ending with the first subsequent SQE that > > > > > doesn't have it set, and it is similar with chain of linked SQEs. > > > > > > > > > > Not like linked SQEs, each sqe is issued after the previous one is completed. > > > > > All SQEs in one group are submitted in parallel, so there isn't any dependency > > > > > among SQEs in one group. > > > > > > > > > > The 1st SQE is group leader, and the other SQEs are group member. The whole > > > > > group share single IOSQE_IO_LINK and IOSQE_IO_DRAIN from group leader, and > > > > > the two flags are ignored for group members. > > > > > > > > > > When the group is in one link chain, this group isn't submitted until the > > > > > previous SQE or group is completed. And the following SQE or group can't > > > > > be started if this group isn't completed. Failure from any group member will > > > > > fail the group leader, then the link chain can be terminated. > > > > > > > > > > When IOSQE_IO_DRAIN is set for group leader, all requests in this group and > > > > > previous requests submitted are drained. Given IOSQE_IO_DRAIN can be set for > > > > > group leader only, we respect IO_DRAIN by always completing group leader as > > > > > the last one in the group. > > > > > > > > > > Working together with IOSQE_IO_LINK, SQE group provides flexible way to > > > > > support N:M dependency, such as: > > > > > > > > > > - group A is chained with group B together > > > > > - group A has N SQEs > > > > > - group B has M SQEs > > > > > > > > > > then M SQEs in group B depend on N SQEs in group A. > > > > > > > > > > N:M dependency can support some interesting use cases in efficient way: > > > > > > > > > > 1) read from multiple files, then write the read data into single file > > > > > > > > > > 2) read from single file, and write the read data into multiple files > > > > > > > > > > 3) write same data into multiple files, and read data from multiple files and > > > > > compare if correct data is written > > > > > > > > > > Also IOSQE_SQE_GROUP takes the last bit in sqe->flags, but we still can > > > > > extend sqe->flags with one uring context flag, such as use __pad3 for > > > > > non-uring_cmd OPs and part of uring_cmd_flags for uring_cmd OP. > > > > > > > > > > Suggested-by: Kevin Wolf <kwolf@xxxxxxxxxx> > > > > > Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx> > > > > > > > > BTW, I wrote one link-grp-cp.c liburing/example which is based on sqe group, > > > > and keep QD not changed, just re-organize IOs in the following ways: > > > > > > > > - each group have 4 READ IOs, linked by one single write IO for writing > > > > the read data in sqe group to destination file > > > > > > IIUC it's comparing 1 large write request with 4 small, and > > > > It is actually reasonable from storage device viewpoint, concurrent > > small READs are often fast than single big READ, but concurrent small > > writes are usually slower. > > It is, but that doesn't make the comparison apple to apple. > Even what I described, even though it's better (same number > of syscalls but better parallelism as you don't block next > batch of reads by writes), you can argues it's not a > completely fair comparison either since needs different number > of buffers, etc. > > > > it's not exactly anything close to fair. And you can do same > > > in userspace (without links). And having control in userspace > > > > No, you can't do it with single syscall. > > That's called you _can_ do it. And syscalls is not everything, For ublk, syscall does mean something, because each ublk IO is handled by io_uring, if more syscalls are introduced for each ublk IO, performance definitely degrades a lot because IOPS can be million level. Now syscall PTI overhead does make difference, please see: https://lwn.net/Articles/752587/ > context switching turned to be a bigger problem, and to execute > links it does exactly that. If that is true, IO_LINK shouldn't have been needed, cause you can model dependency via io_uring syscall, unfortunately it isn't true. IO_LINK not only simplifies application programming, but also avoids extra syscall. If you compare io_uring-cp.c(282 LOC) with link-cp.c(193 LOC) in liburing/examples, you can see io_uring-cp.c is more complicated. Adding one extra syscall(wait point) makes application hard to write, especially in modern async/.await programming environment. Thanks, Ming