Hello, ublk zero copy is observed to improve big chunk(64KB+) sequential IO performance a lot, such as, IOPS of ublk-loop over tmpfs is increased by 1~2X[1], Jens also observed that IOPS of ublk-qcow2 can be increased by ~1X[2]. Meantime it saves memory bandwidth. So this is one important performance improvement. So far there are three proposal: 1) splice based - spliced page from ->splice_read() can't be written ublk READ request can't be handled because spliced page can't be written to, and extending splice for ublk zero copy isn't one good solution[3] - it is very hard to meet above requirements wrt. request buffer lifetime splice/pipe focuses on page reference lifetime, but ublk zero copy pays more attention to ublk request buffer lifetime. If is very inefficient to respect request buffer lifetime by using all pipe buffer's ->release() which requires all pipe buffers and pipe to be kept when ublk server handles IO. That means one single dedicated ``pipe_inode_info`` has to be allocated runtime for each provided buffer, and the pipe needs to be populated with pages in ublk request buffer. IMO, it isn't one good way to take splice from both correctness and performance viewpoint. 2) io_uring register buffer based - the main idea is to register one runtime buffer in fast io path, and unregister it after the buffer is used by the following OPs - the main problem is that bad performance caused by io_uring link model registering buffer has to be one OP, same with unregistering buffer; the following normal OPs(such as FS IO) have to depend on the registering buffer OP, then io_uring link has to be used. It is normal to see more than one normal OPs which depend on the registering buffer OP, so all these OPs(registering buffer, normal (FS IO) OPs and unregistering buffer) have to be linked together, then normal(FS IO) OPs have to be submitted one by one, and this way is slow, because there is often no dependency among all these normal FS OPs. Basically io_uring link model does not support this kind of 1:N dependency. No one posted code for showing this approach yet. 3) io_uring fused command[1] - fused command extend current io_uring usage by allowing submitting following FS OPs(called secondary OPs) after the primary command provides buffer, and primary command won't be completed until all secondary OPs are done. This way solves the problem in 2), and meantime avoids the buffer register cost in both submission and completion IO fast code path because the primary command won't be completed until all secondary OPs are done, so no need to write/read the buffer into per-context global data structure. Meantime buffer lifetime problem is addressed simply, so correctness gets guaranteed, and performance is pretty good, and even IOPS of 4k IO gets a little improved in some workloads, or at least no perf regression is observed for small size IO. fused command can be thought as one single request logically, just it has more than one SQE(all share same link flag), that is why is named as fused command. - the only concern is that fused command starts one use usage of io_uring, but still not see comments wrt. what/why is bad with this kind of new usage/interface. I propose this topic and want to discuss about how to move on with this feature. [1] https://lore.kernel.org/linux-block/20230330113630.1388860-1-ming.lei@xxxxxxxxxx/ [2] https://lore.kernel.org/linux-block/b3fc9991-4c53-9218-a8cc-5b4dd3952108@xxxxxxxxx/ [3] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/ Thanks, Ming