Re: [PATCH V6 6/8] io_uring: support providing sqe group buffer

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 9 Oct 2024 15:25:25 +0100

On 10/6/24 09:20, Ming Lei wrote:
On Fri, Oct 04, 2024 at 04:32:04PM +0100, Pavel Begunkov wrote:
On 9/12/24 11:49, Ming Lei wrote:
...
It can help to implement generic zero copy between device and related
operations, such as ublk, fuse, vdpa,
even network receive or whatever.

That's exactly the thing it can't sanely work with because
of this design.

The provide buffer design is absolutely generic, and basically

- group leader provides buffer for member OPs, and member OPs can borrow
the buffer if leader allows by calling io_provide_group_kbuf()

- after member OPs consumes the buffer, the buffer is returned back by
the callback implemented in group leader subsystem, so group leader can
release related sources;

- and it is guaranteed that the buffer can be released always

The ublk implementation is pretty simple, it can be reused in device driver
to share buffer with other kernel subsystems.

I don't see anything insane with the design.

There is nothing insane with the design, but the problem is cross
request error handling, same thing that makes links a pain to use.
It's good that with storage reads are reasonably idempotent and you
can be retried if needed. With sockets and streams, however, you
can't sanely borrow a buffer without consuming it, so if a member
request processing the buffer fails for any reason, the user data
will be dropped on the floor. I mentioned quite a while before,
if for example you stash the buffer somewhere you can access
across syscalls like the io_uring's registered buffer table, the
user at least would be able to find an error and then memcpy the
unprocessed data as a fallback.

Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
---
   include/linux/io_uring_types.h | 33 +++++++++++++++++++
   io_uring/io_uring.c            | 10 +++++-
   io_uring/io_uring.h            | 10 ++++++
   io_uring/kbuf.c                | 60 ++++++++++++++++++++++++++++++++++
   io_uring/kbuf.h                | 13 ++++++++
   io_uring/net.c                 | 23 ++++++++++++-
   io_uring/opdef.c               |  4 +++
   io_uring/opdef.h               |  2 ++
   io_uring/rw.c                  | 20 +++++++++++-
   9 files changed, 172 insertions(+), 3 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 793d5a26d9b8..445e5507565a 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
...

And I don't think you need opdef::accept_group_kbuf since the
request handler should already know that and, importantly, you don't
imbue any semantics based on it.

Yeah, and it just follows logic of buffer_select. I guess def->buffer_select
may be removed too?

->buffer_select helps to fail IOSQE_BUFFER_SELECT for requests
that don't support it, so we don't need to add the check every
time we add a new request opcode.

In your case requests just ignore ->accept_group_kbuf /
REQ_F_GROUP_KBUF if they don't expect to use the buffer, so
it's different in several aspects.

fwiw, I don't mind ->accept_group_kbuf, I just don't see
what purpose it serves. Would be nice to have a sturdier uAPI,
where the user sets a flag to each member that want to use
these provided buffers and then the kernel checks leader vs
that flag and fails misconfigurations, but likely we don't
have flags / sqe space for it.


FWIW, would be nice if during init figure we can verify that the leader
provides a buffer IFF there is someone consuming it, but I don't think

It isn't doable, same reason with IORING_OP_PROVIDE_BUFFERS, since buffer can
only be provided in ->issue().

In theory we could, in practise it'd be too much of a pain, I agree.

IORING_OP_PROVIDE_BUFFERS is different as you just stash the buffer
in the io_uring instance, and it's used at an unspecified time later
by some request. In this sense the API is explicit, requests that don't
support it but marked with IOSQE_BUFFER_SELECT will be failed by the
kernel.

the semantics is flexible enough to do it sanely. i.e. there are many
members in a group, some might want to use the buffer and some might not.

...
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index df2be7353414..8e111d24c02d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -349,6 +349,16 @@ static inline bool req_is_group_leader(struct io_kiocb *req)
   	return req->flags & REQ_F_SQE_GROUP_LEADER;
   }
...
+int io_import_group_kbuf(struct io_kiocb *req, unsigned long buf_off,
+		unsigned int len, int dir, struct iov_iter *iter)
+{
+	struct io_kiocb *lead = req->grp_link;
+	const struct io_uring_kernel_buf *kbuf;
+	unsigned long offset;
+
+	WARN_ON_ONCE(!(req->flags & REQ_F_GROUP_KBUF));
+
+	if (!req_is_group_member(req))
+		return -EINVAL;
+
+	if (!lead || !req_support_group_dep(lead) || !lead->grp_kbuf)
+		return -EINVAL;
+
+	/* req->fused_cmd_kbuf is immutable */
+	kbuf = lead->grp_kbuf;
+	offset = kbuf->offset;
+
+	if (!kbuf->bvec)
+		return -EINVAL;

How can this happen?

OK, we can run the check in uring_cmd API.

Not sure I follow, if a request providing a buffer can't set
a bvec it should just fail, without exposing half made
io_uring_kernel_buf to other requests.

Is it rather a WARN_ON_ONCE check?


diff --git a/io_uring/net.c b/io_uring/net.c
index f10f5a22d66a..ad24dd5924d2 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -89,6 +89,13 @@ struct io_sr_msg {
    */
   #define MULTISHOT_MAX_RETRY	32
+#define user_ptr_to_u64(x) (		\
+{					\
+	typecheck(void __user *, (x));	\
+	(u64)(unsigned long)(x);	\
+}					\
+)
+
   int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
   {
   	struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown);
@@ -375,7 +382,7 @@ static int io_send_setup(struct io_kiocb *req)
   		kmsg->msg.msg_name = &kmsg->addr;
   		kmsg->msg.msg_namelen = sr->addr_len;
   	}
-	if (!io_do_buffer_select(req)) {
+	if (!io_do_buffer_select(req) && !(req->flags & REQ_F_GROUP_KBUF)) {
   		ret = import_ubuf(ITER_SOURCE, sr->buf, sr->len,
   				  &kmsg->msg.msg_iter);
   		if (unlikely(ret < 0))
@@ -593,6 +600,15 @@ int io_send(struct io_kiocb *req, unsigned int issue_flags)
   	if (issue_flags & IO_URING_F_NONBLOCK)
   		flags |= MSG_DONTWAIT;
+	if (req->flags & REQ_F_GROUP_KBUF) {

Does anything prevent the request to be marked by both
GROUP_KBUF and BUFFER_SELECT? In which case we'd set up
a group kbuf and then go to the io_do_buffer_select()
overriding all of that

It could be used in this way, and we can fail the member in
io_queue_group_members().

That's where the opdef flag could actually be useful,

if (opdef[member]->accept_group_kbuf &&
    member->flags & SELECT_BUF)
	fail;


+		ret = io_import_group_kbuf(req,
+					user_ptr_to_u64(sr->buf),
+					sr->len, ITER_SOURCE,
+					&kmsg->msg.msg_iter);
+		if (unlikely(ret))
+			return ret;
+	}
+
   retry_bundle:
   	if (io_do_buffer_select(req)) {
   		struct buf_sel_arg arg = {
@@ -1154,6 +1170,11 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags)
   			goto out_free;
   		}
   		sr->buf = NULL;
+	} else if (req->flags & REQ_F_GROUP_KBUF) {

What happens if we get a short read/recv?

For short read/recv, any progress is stored in iterator, nothing to do
with the provide buffer, which is immutable.

One problem for read is reissue, but it can be handled by saving iter
state after the group buffer is imported, I will fix it in next version.
For net recv, offset/len of buffer is updated in case of short recv, so
it works as expected.

That was one of my worries.

Or any other issue with short read/recv? Can you explain in detail?

To sum up design wise, when members that are using the buffer as a
source, e.g. write/send, fail, the user is expected to usually reissue
both the write and the ublk cmd.

Let's say you have a ublk leader command providing a 4K buffer, and
you group it with a 4K send using the buffer. Let's assume the send
is short and does't only 2K of data. Then the user would normally
reissue:

ublk(4K, GROUP), send(off=2K)

That's fine assuming short IO is rare.

I worry more about the backward flow, ublk provides an "empty" buffer
to receive/read into. ublk wants to do something with the buffer in
the callback. What happens when read/recv is short (and cannot be
retried by io_uring)?

1. ublk(provide empty 4K buffer)
2. recv, ret=2K
3. ->grp_kbuf_ack: ublk should commit back only 2K
   of data and not assume it's 4K

Another option is to fail ->grp_kbuf_ack if any member fails, but
the API might be a bit too awkward and inconvenient .

--
Pavel Begunkov