On Tue, Mar 07, 2023 at 05:17:04PM +0000, Pavel Begunkov wrote: > On 3/7/23 15:37, Pavel Begunkov wrote: > > On 3/7/23 14:15, Ming Lei wrote: > > > Hello, > > > > > > Add IORING_OP_FUSED_CMD, it is one special URING_CMD, which has to > > > be SQE128. The 1st SQE(master) is one 64byte URING_CMD, and the 2nd > > > 64byte SQE(slave) is another normal 64byte OP. For any OP which needs > > > to support slave OP, io_issue_defs[op].fused_slave needs to be set as 1, > > > and its ->issue() can retrieve/import buffer from master request's > > > fused_cmd_kbuf. The slave OP is actually submitted from kernel, part of > > > this idea is from Xiaoguang's ublk ebpf patchset, but this patchset > > > submits slave OP just like normal OP issued from userspace, that said, > > > SQE order is kept, and batching handling is done too. > > > > From a quick look through patches it all looks a bit complicated > > and intrusive, all over generic hot paths. I think instead we > > should be able to use registered buffer table as intermediary and > > reuse splicing. Let me try it out > > Here we go, isolated in a new opcode, and in the end should work > with any file supporting splice. It's a quick prototype, it's lacking > and there are many obvious fatal bugs. It also needs some optimisations, > improvements on how executed by io_uring and extra stuff like > memcpy ops and fixed buf recv/send. I'll clean it up. > > I used a test below, it essentially does zc recv. > > https://github.com/isilence/liburing/commit/81fe705739af7d9b77266f9aa901c1ada870739d > > > From 87ad9e8e3aed683aa040fb4b9ae499f8726ba393 Mon Sep 17 00:00:00 2001 > Message-Id: <87ad9e8e3aed683aa040fb4b9ae499f8726ba393.1678208911.git.asml.silence@xxxxxxxxx> > From: Pavel Begunkov <asml.silence@xxxxxxxxx> > Date: Tue, 7 Mar 2023 17:01:44 +0000 > Subject: [POC 1/1] io_uring: splicing into reg buf table > > EXTREMELY BUGGY! Not for inclusion. > > Add a new operation called IORING_OP_SPLICE_FROM, > which splices from a file into the registered buffer table. This is > done in a zerocopy fashion with a caveat that the user won't have > direct access to the data, however it can use it with any io_uring > request supporting registered buffers. > > Signed-off-by: Pavel Begunkov <asml.silence@xxxxxxxxx> > --- > include/uapi/linux/io_uring.h | 1 + > io_uring/io_uring.c | 4 +- > io_uring/opdef.c | 10 ++++ > io_uring/splice.c | 98 +++++++++++++++++++++++++++++++++++ > io_uring/splice.h | 3 ++ > 5 files changed, 114 insertions(+), 2 deletions(-) > > diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h > index 709de6d4feb2..a91ce1d2ebd7 100644 > --- a/include/uapi/linux/io_uring.h > +++ b/include/uapi/linux/io_uring.h > @@ -223,6 +223,7 @@ enum io_uring_op { > IORING_OP_URING_CMD, > IORING_OP_SEND_ZC, > IORING_OP_SENDMSG_ZC, > + IORING_OP_SPLICE_FROM, > /* this goes last, obviously */ > IORING_OP_LAST, > diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c > index 7625597b5227..b7389a6ea190 100644 > --- a/io_uring/io_uring.c > +++ b/io_uring/io_uring.c > @@ -2781,8 +2781,8 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) > io_wait_rsrc_data(ctx->file_data); > mutex_lock(&ctx->uring_lock); > - if (ctx->buf_data) > - __io_sqe_buffers_unregister(ctx); > + // if (ctx->buf_data) > + // __io_sqe_buffers_unregister(ctx); > if (ctx->file_data) > __io_sqe_files_unregister(ctx); > io_cqring_overflow_kill(ctx); > diff --git a/io_uring/opdef.c b/io_uring/opdef.c > index cca7c5b55208..28d4fa42676b 100644 > --- a/io_uring/opdef.c > +++ b/io_uring/opdef.c > @@ -428,6 +428,13 @@ const struct io_issue_def io_issue_defs[] = { > .prep = io_eopnotsupp_prep, > #endif > }, > + [IORING_OP_SPLICE_FROM] = { > + .needs_file = 1, > + .unbound_nonreg_file = 1, > + // .pollin = 1, > + .prep = io_splice_from_prep, > + .issue = io_splice_from, > + } > }; > @@ -648,6 +655,9 @@ const struct io_cold_def io_cold_defs[] = { > .fail = io_sendrecv_fail, > #endif > }, > + [IORING_OP_SPLICE_FROM] = { > + .name = "SPLICE_FROM", > + } > }; > const char *io_uring_get_opcode(u8 opcode) > diff --git a/io_uring/splice.c b/io_uring/splice.c > index 2a4bbb719531..0467e9f46e99 100644 > --- a/io_uring/splice.c > +++ b/io_uring/splice.c > @@ -8,11 +8,13 @@ > #include <linux/namei.h> > #include <linux/io_uring.h> > #include <linux/splice.h> > +#include <linux/nospec.h> > #include <uapi/linux/io_uring.h> > #include "io_uring.h" > #include "splice.h" > +#include "rsrc.h" > struct io_splice { > struct file *file_out; > @@ -119,3 +121,99 @@ int io_splice(struct io_kiocb *req, unsigned int issue_flags) > io_req_set_res(req, ret, 0); > return IOU_OK; > } > + > +struct io_splice_from { > + struct file *file; > + loff_t off; > + u64 len; > +}; > + > + > +int io_splice_from_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) > +{ > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); > + > + if (unlikely(sqe->splice_flags || sqe->splice_fd_in || sqe->ioprio || > + sqe->addr || sqe->addr3)) > + return -EINVAL; > + > + req->buf_index = READ_ONCE(sqe->buf_index); > + > + sp->len = READ_ONCE(sqe->len); > + if (unlikely(!sp->len)) > + return -EINVAL; > + > + sp->off = READ_ONCE(sqe->off); > + return 0; > +} > + > +int io_splice_from(struct io_kiocb *req, unsigned int issue_flags) > +{ > + struct io_splice_from *sp = io_kiocb_to_cmd(req, struct io_splice_from); > + loff_t *ppos = (sp->off == -1) ? NULL : &sp->off; > + struct io_mapped_ubuf *imu; > + struct pipe_inode_info *pi; > + struct io_ring_ctx *ctx; > + unsigned int pipe_tail; > + int ret, i, nr_pages; > + u16 index; > + > + if (!sp->file->f_op->splice_read) > + return -ENOTSUPP; > + > + pi = alloc_pipe_info(); The above should be replaced with direct pipe, otherwise every time allocating one pipe inode really hurts performance. > + if (!pi) > + return -ENOMEM; > + pi->readers = 1; > + > + ret = sp->file->f_op->splice_read(sp->file, ppos, pi, sp->len, 0); > + if (ret < 0) > + goto done; > + > + nr_pages = pipe_occupancy(pi->head, pi->tail); > + imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL); > + if (!imu) > + goto done; > + > + ret = 0; > + pipe_tail = pi->tail; > + for (i = 0; !pipe_empty(pi->head, pipe_tail); i++) { > + unsigned int mask = pi->ring_size - 1; // kill mask > + struct pipe_buffer *buf = &pi->bufs[pipe_tail & mask]; > + > + bvec_set_page(&imu->bvec[i], buf->page, buf->len, buf->offset); > + ret += buf->len; > + pipe_tail++; > + } > + if (WARN_ON_ONCE(i != nr_pages)) > + return -EFAULT; > + > + ctx = req->ctx; > + io_ring_submit_lock(ctx, issue_flags); > + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) { > + /* TODO: cleanup pages */ > + ret = -EFAULT; > + kvfree(imu); > + goto done_unlock; > + } > + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); > + if (ctx->user_bufs[index] != ctx->dummy_ubuf) { > + /* TODO: cleanup pages */ > + kvfree(imu); > + ret = -EFAULT; > + goto done_unlock; > + } > + > + imu->ubuf = 0; > + imu->ubuf_end = ret; > + imu->nr_bvecs = nr_pages; > + ctx->user_bufs[index] = imu; Your patch looks like transferring pages ownership to io_uring fixed buffer, but unfortunately it can't be done in this way. splice is supposed for moving data, not transfer buffer ownership. https://lore.kernel.org/linux-block/CAJfpeguQ3xn2-6svkkVXJ88tiVfcDd-eKi1evzzfvu305fMoyw@xxxxxxxxxxxxxx/ 1) pages are actually owned by device side(ublk, here: sp->file), but we want to loan them to io_uring normal OPs. 2) after these pages are used by io_uring normal OPs, these pages have been returned back to sp->file, and the notification has to be done explicitly, because page is owned by sp->file of splice_read(). 3) pages RW direction has to limited strictly, and in case of ublk/fuse, device pages can only be read or write which depends on user io request direction. Also IMO it isn't good to add one buffer to ctx->user_bufs[] oneshot and retrieve it oneshot, and it can be set via req->imu simply in one fused command. Thanks, Ming