Re: [PATCH 3/3] io_uring: add support for zone-append

"javier.gonz@xxxxxxxxxxx" <javier@xxxxxxxxxxx> · Sun, 21 Jun 2020 20:55:05 +0200

On 19.06.2020 09:44, Jens Axboe wrote:
On 6/19/20 9:40 AM, Matias Bjørling wrote:
On 19/06/2020 17.20, Jens Axboe wrote:
On 6/19/20 9:14 AM, Matias Bjørling wrote:
On 19/06/2020 16.18, Jens Axboe wrote:
On 6/19/20 5:15 AM, Matias Bjørling wrote:
On 19/06/2020 11.41, javier.gonz@xxxxxxxxxxx wrote:
Jens,

Would you have time to answer a question below in this thread?

On 18.06.2020 11:11, javier.gonz@xxxxxxxxxxx wrote:
On 18.06.2020 08:47, Damien Le Moal wrote:
On 2020/06/18 17:35, javier.gonz@xxxxxxxxxxx wrote:
On 18.06.2020 07:39, Damien Le Moal wrote:
On 2020/06/18 2:27, Kanchan Joshi wrote:
From: Selvakumar S <selvakuma.s1@xxxxxxxxxxx>

Introduce three new opcodes for zone-append -

    IORING_OP_ZONE_APPEND     : non-vectord, similiar to
IORING_OP_WRITE
    IORING_OP_ZONE_APPENDV    : vectored, similar to IORING_OP_WRITEV
    IORING_OP_ZONE_APPEND_FIXED : append using fixed-buffers

Repurpose cqe->flags to return zone-relative offset.

Signed-off-by: SelvaKumar S <selvakuma.s1@xxxxxxxxxxx>
Signed-off-by: Kanchan Joshi <joshi.k@xxxxxxxxxxx>
Signed-off-by: Nitesh Shetty <nj.shetty@xxxxxxxxxxx>
Signed-off-by: Javier Gonzalez <javier.gonz@xxxxxxxxxxx>
---
fs/io_uring.c                 | 72
+++++++++++++++++++++++++++++++++++++++++--
include/uapi/linux/io_uring.h |  8 ++++-
2 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 155f3d8..c14c873 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -649,6 +649,10 @@ struct io_kiocb {
      unsigned long        fsize;
      u64            user_data;
      u32            result;
+#ifdef CONFIG_BLK_DEV_ZONED
+    /* zone-relative offset for append, in bytes */
+    u32            append_offset;
this can overflow. u64 is needed.
We chose to do it this way to start with because struct io_uring_cqe
only has space for u32 when we reuse the flags.

We can of course create a new cqe structure, but that will come with
larger changes to io_uring for supporting append.

Do you believe this is a better approach?
The problem is that zone size are 32 bits in the kernel, as a number
of sectors.
So any device that has a zone size smaller or equal to 2^31 512B
sectors can be
accepted. Using a zone relative offset in bytes for returning zone
append result
is OK-ish, but to match the kernel supported range of possible zone
size, you
need 31+9 bits... 32 does not cut it.
Agree. Our initial assumption was that u32 would cover current zone size
requirements, but if this is a no-go, we will take the longer path.
Converting to u64 will require a new version of io_uring_cqe, where we
extend at least 32 bits. I believe this will need a whole new allocation
and probably ioctl().

Is this an acceptable change for you? We will of course add support for
liburing when we agree on the right way to do this.
I took a quick look at the code. No expert, but why not use the existing
userdata variable? use the lowest bits (40 bits) for the Zone Starting
LBA, and use the highest (24 bits) as index into the completion data
structure?

If you want to pass the memory address (same as what fio does) for the
data structure used for completion, one may also play some tricks by
using a relative memory address to the data structure. For example, the
x86_64 architecture uses 48 address bits for its memory addresses. With
24 bit, one can allocate the completion entries in a 32MB memory range,
and then use base_address + index to get back to the completion data
structure specified in the sqe.
For any current request, sqe->user_data is just provided back as
cqe->user_data. This would make these requests behave differently
from everything else in that sense, which seems very confusing to me
if I was an application writer.

But generally I do agree with you, there are lots of ways to make
< 64-bit work as a tag without losing anything or having to jump
through hoops to do so. The lack of consistency introduced by having
zone append work differently is ugly, though.

Yep, agree, and extending to three cachelines is big no-go. We could add
a flag that said the kernel has changes the userdata variable. That'll
make it very explicit.
Don't like that either, as it doesn't really change the fact that you're
now doing something very different with the user_data field, which is
just supposed to be passed in/out directly. Adding a random flag to
signal this behavior isn't very explicit either, imho. It's still some
out-of-band (ish) notification of behavior that is different from any
other command. This is very different from having a flag that says
"there's extra information in this other field", which is much cleaner.

Ok. Then it's pulling in the bits from cqe->res and cqe->flags that you
mention in the other mail. Sounds good.

I think that's the best approach, if we need > 32-bits. Maybe we can get
by just using ->res, if we switch to multiples of 512b instead for the
result like Pavel suggested. That'd provide enough room in ->res, and
would be preferable imho. But if we do need > 32-bits, then we can use
this approach.

Sounds good.

Thanks Matias too for chipping in with more ideas. We have enough for a
v2.

Javier