On 11/5/2024 9:53 PM, Keith Busch wrote: > On Tue, Nov 05, 2024 at 05:00:51PM +0100, Christoph Hellwig wrote: >> On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote: >>> Can add the documentation (if this version is palatable for Jens/Pavel), >>> but this was discussed in previous iteration: >>> >>> 1. Each meta type may have different space requirement in SQE. >>> >>> Only for PI, we need so much space that we can't fit that in first SQE. >>> The SQE128 requirement is only for PI type. >>> Another different meta type may just fit into the first SQE. For that we >>> don't have to mandate SQE128. >> >> Ok, I'm really confused now. The way I understood Anuj was that this >> is NOT about block level metadata, but about other uses of the big SQE. >> >> Which version is right? Or did I just completely misunderstand Anuj? > > Let's not call this "meta_type". Can we use something that has a less > overloaded meaning, like "sqe_extended_capabilities", or "ecap", or > something like that. > Right, something like that. We need to change it. Seems a useful thing is not being seen that way because of its name. >>> 2. If two meta types are known not to co-exist, they can be kept in the >>> same place within SQE. Since each meta-type is a flag, we can check what >>> combinations are valid within io_uring and throw the error in case of >>> incompatibility. >> >> And this sounds like what you refer to is not actually block metadata >> as in this patchset or nvme, (or weirdly enough integrity in the block >> layer code). >> >>> 3. Previous version was relying on SQE128 flag. If user set the ring >>> that way, it is assumed that PI information was sent. >>> This is more explicitly conveyed now - if user passed META_TYPE_PI flag, >>> it has sent the PI. This comment in the code: >>> >>> + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ >>> + union { >>> >>> If this flag is not passed, parsing of second SQE is skipped, which is >>> the current behavior as now also one can send regular (non pi) >>> read/write on SQE128 ring. >> >> And while I don't understand how this threads in with the previous >> statements, this makes sense. If you only want to send a pointer (+len) >> to metadata you can use the normal 64-byte SQE. If you want to send >> a PI tuple you need SEQ128. Is that what the various above statements >> try to express? If so the right API to me would be to have two flags: >> >> - a flag that a pointer to metadata is passed. This can work with >> a 64-bit SQE. >> - another flag that a PI tuple is passed. This requires a 128-byte >> and also the previous flag. > > I don't think anything done so far aligns with what Pavel had in mind. > Let me try to lay out what I think he's going for. Just bare with me, > this is just a hypothetical example. I have the same example in mind. > This patch adds a PI extension. > Later, let's say write streams needs another extenion. > Then key per-IO wants another extention. > Then someone else adds wizbang-awesome-feature extention. > > Let's say you have device that can do all 4, or any combination of them. > Pavel wants a solution that is future proof to such a scenario. So not > just a single new "meta_type" with its structure, but a list of types in > no particular order, and their structures. > > That list can exist either in the extended SQE, or in some other user > address that the kernel will need copy. That list is the meta_type bit-flags this series creates. For some future meta_type there can be "META_TYPE_XYZ_INDIRECT" flag and that will mean extra-information needs to fetched via copy_from_user.