On Mon, Mar 27, 2023 at 05:04:01PM +0100, Pavel Begunkov wrote: > On 3/21/23 09:17, Ziyang Zhang wrote: > > On 2023/3/19 00:23, Pavel Begunkov wrote: > > > On 3/16/23 03:13, Xiaoguang Wang wrote: > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > > > > and its ->issue() can retrieve/import buffer from master request's > > > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > > > > submits slave OP just like normal OP issued from userspace, that said, > > > > > SQE order is kept, and batching handling is done too. > > > > Thanks for this great work, seems that we're now in the right direction > > > > to support ublk zero copy, I believe this feature will improve io throughput > > > > greatly and reduce ublk's cpu resource usage. > > > > > > > > I have gone through your 2th patch, and have some little concerns here: > > > > Say we have one ublk loop target device, but it has 4 backend files, > > > > every file will carry 25% of device capacity and it's implemented in stripped > > > > way, then for every io request, current implementation will need issed 4 > > > > fused_cmd, right? 4 slave sqes are necessary, but it would be better to > > > > have just one master sqe, so I wonder whether we can have another > > > > method. The key point is to let io_uring support register various kernel > > > > memory objects, which come from kernel, such as ITER_BVEC or > > > > ITER_KVEC. so how about below actions: > > > > 1. add a new infrastructure in io_uring, which will support to register > > > > various kernel memory objects in it, this new infrastructure could be > > > > maintained in a xarray structure, every memory objects in it will have > > > > a unique id. This registration could be done in a ublk uring cmd, io_uring > > > > offers registration interface. > > > > 2. then any sqe can use these memory objects freely, so long as it > > > > passes above unique id in sqe properly. > > > > Above are just rough ideas, just for your reference. > > > > > > It precisely hints on what I proposed a bit earlier, that makes > > > me not alone thinking that it's a good idea to have a design allowing > > > 1) multiple ops using a buffer and 2) to limiting it to one single > > > submission because the userspace might want to preprocess a part > > > of the data, multiplex it or on the opposite divide. I was mostly > > > coming from non ublk cases, and one example would be such zc recv, > > > parsing the app level headers and redirecting the rest of the data > > > somewhere. > > > > > > I haven't got a chance to work on it but will return to it in > > > a week. The discussion was here: > > > > > > https://lore.kernel.org/all/ce96f7e7-1315-7154-f540-1a3ff0215674@xxxxxxxxx/ > > > > > > > Hi Pavel and all, > > > > I think it is a good idea to register some kernel objects(such as bvec) > > in io_uring and return a cookie(such as buf_idx) for READ/WRITE/SEND/RECV sqes. > > There are some ways to register user's buffer such as IORING_OP_PROVIDE_BUFFERS > > and IORING_REGISTER_PBUF_RING but there is not a way to register kernel buffer(bvec). > > > > I do not think reusing splice is a good idea because splice should run in io-wq. > > The reason why I disabled inline splice execution is because do_splice() > and below the stack doesn't support nowait well enough, which is not a > problem when we hook directly under the ->splice_read() callback and > operate only with one file at a time at the io_uring level. I believe I have explained several times[1][2] it isn't good solution for ublk zero copy. But if you insist on reusing splice for this feature, please share your code and I'm happy to give an review. [1] https://lore.kernel.org/linux-block/ZB8B8cr1%2FqIcPdRM@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#m1bfa358524b6af94731bcd5be28056f9f4408ecf [2] https://github.com/ming1/linux/blob/my_v6.3-io_uring_fuse_cmd_v4/Documentation/block/ublk.rst#zero-copy Thanks, Ming