On Sat, Feb 11, 2023 at 09:52:58AM -0700, Jens Axboe wrote: > On 2/11/23 9:12?AM, Ming Lei wrote: > > On Sat, Feb 11, 2023 at 08:45:18AM -0700, Jens Axboe wrote: > >> On 2/10/23 8:32?AM, Ming Lei wrote: > >>> IORING_OP_READ_SPLICE_BUF: read to buffer which is built from > >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len) > >>> for building buffer. > >>> > >>> IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from > >>> ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len) > >>> for building buffer. > >>> > >>> The typical use case is for supporting ublk/fuse io_uring zero copy, > >>> and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe > >>> from device->read_splice(), then READ/WRITE can be done to/from this > >>> buffer directly. > >> > >> Main question here - would this be better not plumbed up through the rw > >> path? Might be cleaner, even if it either requires a bit of helper > >> refactoring or accepting a bit of duplication. But would still be better > >> than polluting the rw fast path imho. > > > > The buffer is actually IO buffer, which has to be plumbed up in IO path, > > and it can't be done like the registered buffer. > > > > The only affect on fast path is : > > > > if (io_rw_splice_buf(req)) //which just check opcode > > return io_prep_rw_splice_buf(req, sqe); > > > > and the cleanup code which is only done for the two new OPs. > > > > Or maybe I misunderstand your point? Or any detailed suggestion? > > > > Actually the code should be factored into generic helper, since net.c > > need to use them too. Probably it needs to move to rsrc.c? > > Yep, just refactoring out those bits as a prep thing. rsrc could work, > or perhaps a new file for that. OK. > > >> Also seems like this should be separately testable. We can't add new > >> opcodes that don't have a feature test at least, and should also have > >> various corner case tests. A bit of commenting outside of this below. > > > > OK, I will write/add one very simple ublk userspace to liburing for > > test purpose. > > Thanks! Thinking of further, if we use ublk for liburing test purpose, root is often needed, even though we support un-privileged mode, which needs administrator to grant access, so is it still good to do so? It could be easier to add ->splice_read() on /dev/zero for test purpose, just allocate zeroed pages in ->splice_read(), and add them to pipe like ublk->splice_read(), and sink side can read from or write to these pages, but zero's read_iter_zero() won't be affected. And normal splice/tee won't connect to zero too because we only allow it from kernel use. > > >>> diff --git a/io_uring/opdef.c b/io_uring/opdef.c > >>> index 5238ecd7af6a..91e8d8f96134 100644 > >>> --- a/io_uring/opdef.c > >>> +++ b/io_uring/opdef.c > >>> @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = { > >>> .prep = io_eopnotsupp_prep, > >>> #endif > >>> }, > >>> + [IORING_OP_READ_SPLICE_BUF] = { > >>> + .needs_file = 1, > >>> + .unbound_nonreg_file = 1, > >>> + .pollin = 1, > >>> + .plug = 1, > >>> + .audit_skip = 1, > >>> + .ioprio = 1, > >>> + .iopoll = 1, > >>> + .iopoll_queue = 1, > >>> + .prep = io_prep_rw, > >>> + .issue = io_read, > >>> + }, > >>> + [IORING_OP_WRITE_SPLICE_BUF] = { > >>> + .needs_file = 1, > >>> + .hash_reg_file = 1, > >>> + .unbound_nonreg_file = 1, > >>> + .pollout = 1, > >>> + .plug = 1, > >>> + .audit_skip = 1, > >>> + .ioprio = 1, > >>> + .iopoll = 1, > >>> + .iopoll_queue = 1, > >>> + .prep = io_prep_rw, > >>> + .issue = io_write, > >>> + }, > >> > >> Are these really safe with iopoll? > > > > Yeah, after the buffer is built, the handling is basically > > same with IORING_OP_WRITE_FIXED, so I think it is safe. > > Yeah, on a second look, as these are just using the normal read/write > path after that should be fine indeed. > > >> > >>> +static int io_prep_rw_splice_buf(struct io_kiocb *req, > >>> + const struct io_uring_sqe *sqe) > >>> +{ > >>> + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); > >>> + unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len); > >>> + loff_t splice_off = READ_ONCE(sqe->splice_off_in); > >>> + struct io_rw_splice_buf_data data; > >>> + struct io_mapped_ubuf *imu; > >>> + struct fd splice_fd; > >>> + int ret; > >>> + > >>> + splice_fd = fdget(READ_ONCE(sqe->splice_fd_in)); > >>> + if (!splice_fd.file) > >>> + return -EBADF; > >> > >> Seems like this should check for SPLICE_F_FD_IN_FIXED, and also use > >> io_file_get_normal() for the non-fixed case in case someone passed in an > >> io_uring fd. > > > > SPLICE_F_FD_IN_FIXED needs one extra word for holding splice flags, if > > we can use sqe->addr3, I think it is doable. > > I haven't checked the rest, but you can't just use ->splice_flags for > this? ->splice_flags shares memory with rwflags, so can't be used. I think it is fine to use ->addr3, given io_getxattr()/io_setxattr()/ io_msg_ring() has used that. > > In any case, the get path needs to look like io_tee() here, and: > > >>> +out_put_fd: > >>> + if (splice_fd.file) > >>> + fdput(splice_fd); > > this put needs to be gated on whether it's a fixed file or not. Yeah. > > >> If the operation is done, clear NEED_CLEANUP and do the cleanup here? > >> That'll be faster. > > > > The buffer has to be cleaned up after req is completed, since bvec > > table is needed for bio, and page reference need to be dropped after > > IO is done too. > > I mean when you clear that flag, call the cleanup bits you otherwise > would've called on later cleanup. Got it. Thanks, Ming