On 5/11/24 01:12, Ming Lei wrote:
SQE group with REQ_F_SQE_GROUP_DEP introduces one new mechanism to share
resource among one group of requests, and all member requests can consume
the resource provided by group lead efficiently in parallel.
This patch uses the added sqe group feature REQ_F_SQE_GROUP_DEP to share
kernel buffer in sqe group:
- the group lead provides kernel buffer to member requests
- member requests use the provided buffer to do FS or network IO, or more
operations in future
- this kernel buffer is returned back after member requests use it up
This way looks a bit similar with kernel's pipe/splice, but there are some
important differences:
- splice is for transferring data between two FDs via pipe, and fd_out can
only read data from pipe; this feature can borrow buffer from group lead to
members, so member request can write data to this buffer if the provided
buffer is allowed to write to.
- splice implements data transfer by moving pages between subsystem and
pipe, that means page ownership is transferred, and this way is one of the
most complicated thing of splice; this patch supports scenarios in which
the buffer can't be transferred, and buffer is only borrowed to member
requests, and is returned back after member requests consume the provided
buffer, so buffer lifetime is simplified a lot. Especially the buffer is
guaranteed to be returned back.
- splice can't run in async way basically
It can help to implement generic zero copy between device and related
operations, such as ublk, fuse, vdpa, even network receive or whatever.
Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
---
include/linux/io_uring_types.h | 33 +++++++++++++++++++
io_uring/io_uring.c | 10 +++++-
io_uring/io_uring.h | 5 +++
io_uring/kbuf.c | 60 ++++++++++++++++++++++++++++++++++
io_uring/kbuf.h | 13 ++++++++
io_uring/net.c | 31 +++++++++++++++++-
io_uring/opdef.c | 5 +++
io_uring/opdef.h | 2 ++
io_uring/rw.c | 20 +++++++++++-
9 files changed, 176 insertions(+), 3 deletions(-)
...
diff --git a/io_uring/net.c b/io_uring/net.c
index 070dea9a4eda..83fd5879082e 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -79,6 +79,13 @@ struct io_sr_msg {
...
retry_bundle:
if (io_do_buffer_select(req)) {
struct buf_sel_arg arg = {
@@ -1132,6 +1148,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
if (unlikely(ret))
goto out_free;
sr->buf = NULL;
+ } else if (req->flags & REQ_F_GROUP_KBUF) {
+ ret = io_import_group_kbuf(req, user_ptr_to_u64(sr->buf),
+ sr->len, ITER_DEST, &kmsg->msg.msg_iter);
+ if (unlikely(ret))
+ goto out_free;
}
kmsg->msg.msg_inq = -1;
@@ -1334,6 +1355,14 @@ static int io_send_zc_import(struct io_kiocb *req, struct io_async_msghdr *kmsg)
if (unlikely(ret))
return ret;
kmsg->msg.sg_from_iter = io_sg_from_iter;
+ } else if (req->flags & REQ_F_GROUP_KBUF) {
+ struct io_sr_msg *sr = io_kiocb_to_cmd(req, struct io_sr_msg);
+
+ ret = io_import_group_kbuf(req, user_ptr_to_u64(sr->buf),
+ sr->len, ITER_SOURCE, &kmsg->msg.msg_iter);
+ if (unlikely(ret))
+ return ret;
+ kmsg->msg.sg_from_iter = io_sg_from_iter;
Not looking here too deeply I'm pretty sure it's buggy.
The buffer can only be reused once the notification
CQE completes, and there is nothing in regards to it.
} else {
ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len, &kmsg->msg.msg_iter);
if (unlikely(ret))
--
Pavel Begunkov