On Wed, Apr 19, 2023 at 03:42:40PM +0000, Bernd Schubert wrote: > On 4/19/23 13:19, Ming Lei wrote: > > On Wed, Apr 19, 2023 at 09:56:43AM +0000, Bernd Schubert wrote: > >> On 4/19/23 03:51, Ming Lei wrote: > >>> On Tue, Apr 18, 2023 at 07:38:03PM +0000, Bernd Schubert wrote: > >>>> On 3/30/23 13:36, Ming Lei wrote: > >>>> [...] > >>>>> V6: > >>>>> - re-design fused command, and make it more generic, moving sharing buffer > >>>>> as one plugin of fused command, so in future we can implement more plugins > >>>>> - document potential other use cases of fused command > >>>>> - drop support for builtin secondary sqe in SQE128, so all secondary > >>>>> requests has standalone SQE > >>>>> - make fused command as one feature > >>>>> - cleanup & improve naming > >>>> > >>>> Hi Ming, et al., > >>>> > >>>> I started to wonder if fused SQE could be extended to combine multiple > >>>> syscalls, for example open/read/close. Which would be another solution > >>>> for the readfile syscall Miklos had proposed some time ago. > >>>> > >>>> https://lore.kernel.org/lkml/CAJfpegusi8BjWFzEi05926d4RsEQvPnRW-w7My=ibBHQ8NgCuw@xxxxxxxxxxxxxx/ > >>>> > >>>> If fused SQEs could be extended, I think it would be quite helpful for > >>>> many other patterns. Another similar examples would open/write/close, > >>>> but ideal would be also to allow to have it more complex like > >>>> "open/write/sync_file_range/close" - open/write/close might be the > >>>> fastest and could possibly return before sync_file_range. Use case for > >>>> the latter would be a file server that wants to give notifications to > >>>> client when pages have been written out. > >>> > >>> The above pattern needn't fused command, and it can be done by plain > >>> SQEs chain, follows the usage: > >>> > >>> 1) suppose you get one command from /dev/fuse, then FUSE daemon > >>> needs to handle the command as open/write/sync/close > >>> 2) get sqe1, prepare it for open syscall, mark it as IOSQE_IO_LINK; > >>> 3) get sqe2, prepare it for write syscall, mark it as IOSQE_IO_LINK; > >>> 4) get sqe3, prepare it for sync file range syscall, mark it as IOSQE_IO_LINK; > >>> 5) get sqe4, prepare it for close syscall > >>> 6) io_uring_enter(); //for submit and get events > >> > >> Oh, I was not aware that IOSQE_IO_LINK could pass the result of open > >> down to the others. Hmm, the example I find for open is > >> io_uring_prep_openat_direct in test_open_fixed(). It probably gets off > >> topic here, but one needs to have ring prepared with > >> io_uring_register_files_sparse, then manually manages available indexes > >> and can then link commands? Interesting! > > > > Yeah, see test/fixed-reuse.c of liburing > > > >> > >>> > >>> Then all the four OPs are done one by one by io_uring internal > >>> machinery, and you can choose to get successful CQE for each OP. > >>> > >>> Is the above what you want to do? > >>> > >>> The fused command proposal is actually for zero copy(but not limited to zc). > >> > >> Yeah, I had just thought that IORING_OP_FUSED_CMD could be modified to > >> support generic passing, as it kind of hands data (buffers) from one sqe > >> to the other. I.e. instead of buffers it would have passed the fd, but > >> if this is already possible - no need to make IORING_OP_FUSED_CMD more > >> complex.man > > > > The way of passing FD introduces other cost, read op running into async, > > and adding it into global table, which introduces runtime cost. > > Hmm, question from my side is why it needs to be in the global table, > when it could be just passed to the linked or fused sqe? Any data which crosses OPs need be registered to somewhere, such as fixed buffer, fixed FD, here global meant context wide, and it is actually from OP/SQE viewpoint. Fused command actually is one whole command logically, even though it may includes multiple SQEs. Then registration as context wide isn't needn't(since it is known buffer sharing isn't context wide, and just among several IOs), meantime dependency is avoided, so link isn't needed. This way helps performance a lot, such as, in test on ublk/loop over tmpfs, iops drops to 1/2 with registration in 4k rand io, but fused command actually improves iops a bit, baseline is current in-tree ublk driver/ublksrv. > > > > > That is the reason why fused command is designed in the following way: > > > > - link can be avoided, so OPs needn't to be run in async > > - no need to add buffer into global table > > > > Cause it is really in fast io path. > > > >> > >>> > >>> If the above write OP need to write to file with in-kernel buffer > >>> of /dev/fuse directly, you can get one sqe0 and prepare it for primary command > >>> before 1), and set sqe2->addr to offet of the buffer in 3). > >>> > >>> However, fused command is usually used in the following way, such as FUSE daemon > >>> gets one READ request from /dev/fuse, FUSE userspace can handle the READ request > >>> as io_uring fused command: > >>> > >>> 1) get sqe0 and prepare it for primary command, in which you need to > >>> provide info for retrieving kernel buffer/pages of this READ request > >>> > >>> 2) suppose this READ request needs to be handled by translating it to > >>> READs to two files/devices, considering it as one mirror: > >>> > >>> - get sqe1, prepare it for read from file1, and set sqe->addr to offset > >>> of the buffer in 1), set sqe->len as length for read; this READ OP > >>> uses the kernel buffer in 1) directly > >>> > >>> - get sqe2, prepare it for read from file2, and set sqe->addr to offset > >>> of buffer in 1), set sqe->len as length for read; this READ OP > >>> uses the kernel buffer in 1) directly > >>> > >>> 3) submit the three sqe by io_uring_enter() > >>> > >>> sqe1 and sqe2 can be submitted concurrently or be issued one by one > >>> in order, fused command supports both, and depends on user requirement. > >>> But io_uring linked OPs is usually slower. > >>> > >>> Also file1/file2 needs to be opened beforehand in this example, and FD is > >>> passed to sqe1/sqe2, another choice is to use fixed File; Also you can > >>> add the open/close() OPs into above steps, which need these open/close/READ > >>> to be linked in order, usually slower tnan non-linked OPs. > >> > >> > >> Yes thanks, I'm going to prepare this in an branch, otherwise current > >> fuse-uring would have a ZC regression (although my target ddn projects > >> cannot make use of it, as we need access to the buffer for checksums, etc). > > > > storage has similar use case too, such as encrypt, nvme tcp data digest, > > ..., if the checksum/encrypt approach is standard, maybe one new OP or > > syscall can be added for doing that on kernel buffer directly. > > I very much see the use case for FUSED_CMD for overlay or simple network > sockets. Now in the HPC world one typically uses IB RDMA and if that > fails for some reasons (like connection down), tcp or other interfaces > as fallback. And there is sending the right part of the buffer to the > right server and erasure coding involved - it gets complex and I don't > think there is a way for us without a buffer copy. As I mentioned, it(checksum, encrypt, ...) becomes one generic issue if the zero copy approach is accepted, meantime the problem itself is well-defined, so I don't worry no solution can be figured out. Meantime big memory copy does consume both cpu and memory bandwidth a lot, and 64k/512k ublk io has shown this big difference wrt. copy vs. zero copy. Thanks, Ming