On 19/06/2020 17:15, Jens Axboe wrote: > On 6/19/20 3:41 AM, javier.gonz@xxxxxxxxxxx wrote: >> Jens, >> >> Would you have time to answer a question below in this thread? >> >> On 18.06.2020 11:11, javier.gonz@xxxxxxxxxxx wrote: >>> On 18.06.2020 08:47, Damien Le Moal wrote: >>>> On 2020/06/18 17:35, javier.gonz@xxxxxxxxxxx wrote: >>>>> On 18.06.2020 07:39, Damien Le Moal wrote: >>>>>> On 2020/06/18 2:27, Kanchan Joshi wrote: >>>>>>> From: Selvakumar S <selvakuma.s1@xxxxxxxxxxx> >>>>>>> >>>>>>> Introduce three new opcodes for zone-append - >>>>>>> >>>>>>> IORING_OP_ZONE_APPEND : non-vectord, similiar to IORING_OP_WRITE >>>>>>> IORING_OP_ZONE_APPENDV : vectored, similar to IORING_OP_WRITEV >>>>>>> IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers >>>>>>> >>>>>>> Repurpose cqe->flags to return zone-relative offset. >>>>>>> >>>>>>> Signed-off-by: SelvaKumar S <selvakuma.s1@xxxxxxxxxxx> >>>>>>> Signed-off-by: Kanchan Joshi <joshi.k@xxxxxxxxxxx> >>>>>>> Signed-off-by: Nitesh Shetty <nj.shetty@xxxxxxxxxxx> >>>>>>> Signed-off-by: Javier Gonzalez <javier.gonz@xxxxxxxxxxx> >>>>>>> --- >>>>>>> fs/io_uring.c | 72 +++++++++++++++++++++++++++++++++++++++++-- >>>>>>> include/uapi/linux/io_uring.h | 8 ++++- >>>>>>> 2 files changed, 77 insertions(+), 3 deletions(-) >>>>>>> >>>>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c >>>>>>> index 155f3d8..c14c873 100644 >>>>>>> --- a/fs/io_uring.c >>>>>>> +++ b/fs/io_uring.c >>>>>>> @@ -649,6 +649,10 @@ struct io_kiocb { >>>>>>> unsigned long fsize; >>>>>>> u64 user_data; >>>>>>> u32 result; >>>>>>> +#ifdef CONFIG_BLK_DEV_ZONED >>>>>>> + /* zone-relative offset for append, in bytes */ >>>>>>> + u32 append_offset; >>>>>> >>>>>> this can overflow. u64 is needed. >>>>> >>>>> We chose to do it this way to start with because struct io_uring_cqe >>>>> only has space for u32 when we reuse the flags. >>>>> >>>>> We can of course create a new cqe structure, but that will come with >>>>> larger changes to io_uring for supporting append. >>>>> >>>>> Do you believe this is a better approach? >>>> >>>> The problem is that zone size are 32 bits in the kernel, as a number >>>> of sectors. So any device that has a zone size smaller or equal to >>>> 2^31 512B sectors can be accepted. Using a zone relative offset in >>>> bytes for returning zone append result is OK-ish, but to match the >>>> kernel supported range of possible zone size, you need 31+9 bits... >>>> 32 does not cut it. >>> >>> Agree. Our initial assumption was that u32 would cover current zone size >>> requirements, but if this is a no-go, we will take the longer path. >> >> Converting to u64 will require a new version of io_uring_cqe, where we >> extend at least 32 bits. I believe this will need a whole new allocation >> and probably ioctl(). >> >> Is this an acceptable change for you? We will of course add support for >> liburing when we agree on the right way to do this. > > If you need 64-bit of return value, then it's not going to work. Even > with the existing patches, reusing cqe->flags isn't going to fly, as > it would conflict with eg doing zone append writes with automatic > buffer selection. Buffer selection is for reads/recv kind of requests, but appends are writes. In theory they can co-exist using cqe->flags. > > We're not changing the io_uring_cqe. It's important to keep it lean, and > any other request type is generally fine with 64-bit tag + 32-bit result > (and 32-bit flags on the side) for completions. > > Only viable alternative I see would be to provide an area to store this > information, and pass in a pointer to this at submission time through > the sqe. One issue I do see with that is if we only have this > information available at completion time, then we'd need some async punt > to copy it to user space... Generally not ideal. > > A hackier approach would be to play some tricks with cqe->res and > cqe->flags, setting aside a flag to denote an extension of cqe->res. > That would mean excluding zone append (etc) from using buffer selection, > which probably isn't a huge deal. It'd be more problematic for any other > future flags. But if you just need 40 bits, then it could certainly > work. Rigth now, if cqe->flags & 1 is set, then (cqe->flags >> 16) is > the buffer ID. You could define IORING_CQE_F_ZONE_FOO to be bit 1, so > that: > > uint64_t val = cqe->res; // assuming non-error here > > if (cqe->flags & IORING_CQE_F_ZONE_FOO) > val |= (cqe->flags >> 16) << 32ULL; > > and hence use the upper 16 bits of cqe->flags for the upper bits of your > (then) 48-bit total value. How about returning offset in terms of 512-bytes chunks? NVMe is 512B atomic/aligned. We'll lose an ability to do non-512 aligned appends, but it won't hit media as such anyway (will be padded or cached), so personally I don't see much benefit in having it. -- Pavel Begunkov