On 11/5/2024 9:30 PM, Christoph Hellwig wrote: > On Tue, Nov 05, 2024 at 09:21:27PM +0530, Kanchan Joshi wrote: >> Can add the documentation (if this version is palatable for Jens/Pavel), >> but this was discussed in previous iteration: >> >> 1. Each meta type may have different space requirement in SQE. >> >> Only for PI, we need so much space that we can't fit that in first SQE. >> The SQE128 requirement is only for PI type. >> Another different meta type may just fit into the first SQE. For that we >> don't have to mandate SQE128. > > Ok, I'm really confused now. The way I understood Anuj was that this > is NOT about block level metadata, but about other uses of the big SQE. > > Which version is right? Or did I just completely misunderstand Anuj? We both mean the same. Currently read/write don't [need to] use big SQE as all the information is there in the first SQE. Down the line there may be users fighting for space in SQE. The flag (meta_type) may help a bit when that happens. >> 2. If two meta types are known not to co-exist, they can be kept in the >> same place within SQE. Since each meta-type is a flag, we can check what >> combinations are valid within io_uring and throw the error in case of >> incompatibility. > > And this sounds like what you refer to is not actually block metadata > as in this patchset or nvme, (or weirdly enough integrity in the block > layer code). Right, not about block metadata/pi. But some extra information (different in size/semantics etc.) that user wants to pass into SQE along with read/write. >> 3. Previous version was relying on SQE128 flag. If user set the ring >> that way, it is assumed that PI information was sent. >> This is more explicitly conveyed now - if user passed META_TYPE_PI flag, >> it has sent the PI. This comment in the code: >> >> + /* if sqe->meta_type is META_TYPE_PI, last 32 bytes are for PI */ >> + union { >> >> If this flag is not passed, parsing of second SQE is skipped, which is >> the current behavior as now also one can send regular (non pi) >> read/write on SQE128 ring. > > And while I don't understand how this threads in with the previous > statements, this makes sense. If you only want to send a pointer (+len) > to metadata you can use the normal 64-byte SQE. If you want to send > a PI tuple you need SEQ128. Is that what the various above statements > try to express? Not exactly. You are talking about pi-type 0 (which only requires meta buffer/len) versus !0 pi-type. We thought about it, but decided to keep asking for SQE128 regardless of that (pi 0 or non-zero). In both cases user will set meta-buffer/len, and other type-specific flags are taken care by the low-level code. This keeps thing simple and at io_uring level we don't have to distinguish that case. What I rather meant in this statement was - one can setup a ring with SQE128 today and send IORING_OP_READ/IORING_OP_WRITE. That goes fine without any processing/error as SQE128 is skipped completely. So relying only on SQE128 flag to detect the presence of PI is a bit fragile.