On 6/19/20 3:41 AM, javier.gonz@xxxxxxxxxxx wrote: > Jens, > > Would you have time to answer a question below in this thread? > > On 18.06.2020 11:11, javier.gonz@xxxxxxxxxxx wrote: >> On 18.06.2020 08:47, Damien Le Moal wrote: >>> On 2020/06/18 17:35, javier.gonz@xxxxxxxxxxx wrote: >>>> On 18.06.2020 07:39, Damien Le Moal wrote: >>>>> On 2020/06/18 2:27, Kanchan Joshi wrote: >>>>>> From: Selvakumar S <selvakuma.s1@xxxxxxxxxxx> >>>>>> >>>>>> Introduce three new opcodes for zone-append - >>>>>> >>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE >>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV >>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers >>>>>> >>>>>> Repurpose cqe->flags to return zone-relative offset. >>>>>> >>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@xxxxxxxxxxx> >>>>>> Signed-off-by: Kanchan Joshi <joshi.k@xxxxxxxxxxx> >>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@xxxxxxxxxxx> >>>>>> Signed-off-by: Javier Gonzalez <javier.gonz@xxxxxxxxxxx> >>>>>> --- >>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++-- >>>>>> include/uapi/linux/io_uring.h | 8 ++++- >>>>>> 2 files changed, 77 insertions(+), 3 deletions(-) >>>>>> >>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>> index 155f3d8..c14c873 100644 >>>>>> --- a/fs/io_uring.c >>>>>> +++ b/fs/io_uring.c >>>>>> @@ -649,6 +649,10 @@ struct io_kiocb { >>>>>> unsigned long fsize; >>>>>> u64 user_data; >>>>>> u32 result; >>>>>> +#ifdef CONFIG_BLK_DEV_ZONED >>>>>> + /* zone-relative offset for append, in bytes */ >>>>>> + u32 append_offset; >>>>> >>>>> this can overflow. u64 is needed. >>>> >>>> We chose to do it this way to start with because struct io_uring_cqe >>>> only has space for u32 when we reuse the flags. >>>> >>>> We can of course create a new cqe structure, but that will come with >>>> larger changes to io_uring for supporting append. >>>> >>>> Do you believe this is a better approach? >>> >>> The problem is that zone size are 32 bits in the kernel, as a number >>> of sectors. So any device that has a zone size smaller or equal to >>> 2^31 512B sectors can be accepted. Using a zone relative offset in >>> bytes for returning zone append result is OK-ish, but to match the >>> kernel supported range of possible zone size, you need 31+9 bits... >>> 32 does not cut it. >> >> Agree. Our initial assumption was that u32 would cover current zone size >> requirements, but if this is a no-go, we will take the longer path. > > Converting to u64 will require a new version of io_uring_cqe, where we > extend at least 32 bits. I believe this will need a whole new allocation > and probably ioctl(). > > Is this an acceptable change for you? We will of course add support for > liburing when we agree on the right way to do this. If you need 64-bit of return value, then it's not going to work. Even with the existing patches, reusing cqe->flags isn't going to fly, as it would conflict with eg doing zone append writes with automatic buffer selection. We're not changing the io_uring_cqe. It's important to keep it lean, and any other request type is generally fine with 64-bit tag + 32-bit result (and 32-bit flags on the side) for completions. Only viable alternative I see would be to provide an area to store this information, and pass in a pointer to this at submission time through the sqe. One issue I do see with that is if we only have this information available at completion time, then we'd need some async punt to copy it to user space... Generally not ideal. A hackier approach would be to play some tricks with cqe->res and cqe->flags, setting aside a flag to denote an extension of cqe->res. That would mean excluding zone append (etc) from using buffer selection, which probably isn't a huge deal. It'd be more problematic for any other future flags. But if you just need 40 bits, then it could certainly work. Rigth now, if cqe->flags & 1 is set, then (cqe->flags >> 16) is the buffer ID. You could define IORING_CQE_F_ZONE_FOO to be bit 1, so that: uint64_t val = cqe->res; // assuming non-error here if (cqe->flags & IORING_CQE_F_ZONE_FOO) val |= (cqe->flags >> 16) << 32ULL; and hence use the upper 16 bits of cqe->flags for the upper bits of your (then) 48-bit total value. -- Jens Axboe