Re: [PATCH v2 0/2] zone-append support in io-uring and aio

"javier.gonz@xxxxxxxxxxx" <javier@xxxxxxxxxxx> · Fri, 26 Jun 2020 09:03:45 +0200

On 26.06.2020 06:56, Damien Le Moal wrote:
On 2020/06/26 15:37, javier.gonz@xxxxxxxxxxx wrote:
On 26.06.2020 03:11, Damien Le Moal wrote:
On 2020/06/26 2:18, Kanchan Joshi wrote:
[Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox]

This patchset enables zone-append using io-uring/linux-aio, on block IO path.
Purpose is to provide zone-append consumption ability to applications which are
using zoned-block-device directly.

The application may specify RWF_ZONE_APPEND flag with write when it wants to
send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring,
aio, and pwritev2. An error is reported if zone-append is requested using
pwritev2. It is not in the scope of this patchset to support pwritev2 or any
other sync write API for reasons described later.

Zone-append completion result --->
With zone-append, where write took place can only be known after completion.
So apart from usual return value of write, additional mean is needed to obtain
the actual written location.

In aio, this is returned to application using res2 field of io_event -

struct io_event {
        __u64           data;           /* the data field from the iocb */
        __u64           obj;            /* what iocb this event came from */
        __s64           res;            /* result code for this event */
        __s64           res2;           /* secondary result */
};

In io-uring, cqe->flags is repurposed for zone-append result.

struct io_uring_cqe {
        __u64   user_data;      /* sqe->data submission passed back */
        __s32   res;            /* result code for this event */
        __u32   flags;
};

Since 32 bit flags is not sufficient, we choose to return zone-relative offset
in sector/512b units. This can cover zone-size represented by chunk_sectors.
Applications will have the trouble to combine this with zone start to know
disk-relative offset. But if more bits are obtained by pulling from res field
that too would compel application to interpret res field differently, and it
seems more painstaking than the former option.
To keep uniformity, even with aio, zone-relative offset is returned.

I am really not a fan of this, to say the least. The input is byte offset, the
output is 512B relative sector count... Arg... We really cannot do better than
that ?

At the very least, byte relative offset ? The main reason is that this is
_somewhat_ acceptable for raw block device accesses since the "sector"
abstraction has a clear meaning, but once we add iomap/zonefs async zone append
support, we really will want to have byte unit as the interface is regular
files, not block device file. We could argue that 512B sector unit is still
around even for files (e.g. block counts in file stat). Bu the different unit
for input and output of one operation is really ugly. This is not nice for the user.

You can refer to the discussion with Jens, Pavel and Alex on the uring
interface. With the bits we have and considering the maximun zone size
supported, there is no space for a byte relative offset. We can take
some bits from cqe->res, but we were afraid this is not very
future-proof. Do you have a better idea?

If you can take 8 bits, that gives you 40 bits, enough to support byte relative
offsets for any zone size defined as a number of 512B sectors using an unsigned
int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am
already too tired and my math is failing me...

Yes, the match is correct. I was thinking more of the bits being needed
for other use-case that could collide with append. We considered this
and discard it for being messy - when Pavel brought up the 512B
alignment we saw it as a good alternative.

Note too that we would be able to translate to a byte offset in
iouring.h too so the user would not need to think of this.

I do not feel strongly on this, so the one that better fits the current
and near-future for uring, that is the one we will send on V3. Will give
it until next week for others to comment too.

zone size is defined by chunk_sectors, which is used for raid and software raids
too. This has been an unsigned int forever. I do not see the need for changing
this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB
does not really make any sense. Same for zone size...

Yes. I think already max zone sizes are pretty huge. But yes, this might
change, so we will take it when it happens.

[...]

Javier