Re: [PATCH v6 06/10] io_uring/rw: add support to send metadata along with read/write

Pavel Begunkov <asml.silence@xxxxxxxxx> · Thu, 7 Nov 2024 17:23:29 +0000

On 11/1/24 17:54, Kanchan Joshi wrote:
On 10/31/2024 8:09 PM, Pavel Begunkov wrote:
On 10/30/24 21:09, Keith Busch wrote:
On Wed, Oct 30, 2024 at 11:31:08PM +0530, Kanchan Joshi wrote:

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/
io_uring.h
index 024745283783..48dcca125db3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -105,6 +105,22 @@ struct io_uring_sqe {
            */
           __u8    cmd[0];
       };
+    /*
+     * If the ring is initialized with IORING_SETUP_SQE128, then
+     * this field is starting offset for 64 bytes of data. For meta io
+     * this contains 'struct io_uring_meta_pi'
+     */
+    __u8    big_sqe[0];
+};

I don't think zero sized arrays are good as a uapi regardless of
cmd[0] above, let's just do

sqe = get_sqe();
big_sqe = (void *)(sqe + 1)

with an appropriate helper.

In one of the internal version I did just that (i.e., sqe + 1), and
that's fine for kernel.
But afterwards added big_sqe so that userspace can directly access
access second-half of SQE_128. We have the similar big_cqe[] within
io_uring_cqe too.

Is this still an eyesore?

Yes, let's kill it as well please, and I don't think the feature
really cares about it, so should be easy to do if not already in
later revisions.

+
+/* this is placed in SQE128 */
+struct io_uring_meta_pi {
+    __u16        pi_flags;
+    __u16        app_tag;
+    __u32        len;
+    __u64        addr;
+    __u64        seed;
+    __u64        rsvd[2];
   };

On the previous version, I was more questioning if it aligns with what

I missed that discussion, let me know if I need to look it up

Yes, please take a look at previous iteration (v5):
https://lore.kernel.org/io-uring/e7aae741-c139-48d1-bb22-dbcd69aa2f73@xxxxxxxxxxx/

"But in general, this is about seeing metadata as a generic term to
encode extra information into io_uring SQE."

Yep, that's the idea, and it also sounds to me that stream hints
is one potential user as well. To summarise, the end goal is to be
able to add more meta types/attributes in the future, which can be
file specific, e.g. pipes don't care about integrity data, and to
be able to pass an arbitrary number of such attributes to a single
request.

We don't need to implement it here, but the uapi needs to be flexible
enough to be able to accommodate that, or we should have an
understanding how it can be extended without dirty hacks.

Also the corresponding code, since my other answers will use that.

Pavel was trying to do here. I didn't quite get it, so I was more
confused than saying it should be this way now.

The point is, SQEs don't have nearly enough space to accommodate all
such optional features, especially when it's taking so much space and
not applicable to all reads but rather some specific  use cases and
files. Consider that there might be more similar extensions and we might
even want to use them together.

1. SQE128 makes it big for all requests, intermixing with requests that
don't need additional space wastes space. SQE128 is fine to use but at
the same time we should be mindful about it and try to avoid enabling it
if feasible.

Right. And initial versions of this series did not use SQE128. But as we
moved towards passing more comprehensive PI information, first SQE was
not enough. And we thought to make use of SQE128 rather than taking
copy_from_user cost.

Do we have any data how expensive it is? I don't think I've ever
tried to profile it. And where the overhead comes from? speculation
prevention?

If it's indeed costly, we can add sth to io_uring like pre-mapping
memory to optimise it, which would be useful in other places as
well.
 
  > 2. This API hard codes io_uring_meta_pi into the extended part of the
SQE. If we want to add another feature it'd need to go after the meta
struct. SQE256?

Not necessarily. It depends on how much extra space it needs for another
feature. To keep free space in first SQE, I chose to place PI in the
second one. Anyone requiring 20b (in v6) or 18b (in v5) space, does not
even have to ask for SQE128.
For more, they can use leftover space in second SQE (about half of
second sqe will still be free). In v5, they have entire second SQE if
they don't want to use PI.
If contiguity is a concern, we can move all PI bytes (about 32b) to the
end of second SQE.


  > And what if the user doesn't need PI but only the second
feature?

Not this version, but v5 exposed meta_type as bit flags.

There has to be a type, I assume it's being added back.

And with that, user will not pass the PI flag and that enables to use
all the PI bytes for something else. We will have union of PI with some
other info that is known not to co-exist.

Let's say we have 3 different attributes META_TYPE{1,2,3}.

How are they placed in an SQE?

meta1 = (void *)get_big_sqe(sqe);
meta2 = meta1 + sizeof(?); // sizeof(struct meta1_struct)
meta3 = meta2 + sizeof(struct meta2_struct);

Structures are likely not fixed size (?). At least the PI looks large
enough to force everyone to be just aliased to it.

And can the user pass first meta2 in the sqe and then meta1?

meta2 = (void *)get_big_sqe(sqe);
meta1 = meta2 + sizeof(?); // sizeof(struct meta2_struct)

If yes, how parsing should look like? Does the kernel need to read each
chunk's type and look up its size to iterate to the next one?

If no, what happens if we want to pass meta2 and meta3, do they start
from the big_sqe?

How do we pass how many of such attributes is there for the request?

It should support arbitrary number of attributes in the long run, which
we can't pass in an SQE, bumping the SQE size is not scalable in
general, so it'd need to support user pointers or sth similar at some
point. Placing them in an SQE can serve as an optimisation, and a first
step, though it might be easier to start with user pointer instead.

Also, when we eventually come to user pointers, we want it to be
performant as well and e.g. get by just one copy_from_user, and the
api/struct layouts would need to be able to support it. And once it's
copied we'll want it to be handled uniformly with the SQE variant, that
requires a common format. For different formats there will be a question
of perfomance, maintainability, duplicating kernel and userspace code.

All that doesn't need to be implemented, but we need a clear direction
for the API. Maybe we can get a simplified user space pseudo code
showing how the end API is supposed to look like?

--
Pavel Begunkov