On Thu, Jun 25, 2020 at 10:45:47PM +0530, Kanchan Joshi wrote: > Zone-append completion result ---> > With zone-append, where write took place can only be known after completion. > So apart from usual return value of write, additional mean is needed to obtain > the actual written location. > > In aio, this is returned to application using res2 field of io_event - > > struct io_event { > __u64 data; /* the data field from the iocb */ > __u64 obj; /* what iocb this event came from */ > __s64 res; /* result code for this event */ > __s64 res2; /* secondary result */ > }; Ah, now I understand. I think you're being a little too specific by calling this zone-append. This is really a "write-anywhere" operation, and the specified address is only a hint. > In io-uring, cqe->flags is repurposed for zone-append result. > > struct io_uring_cqe { > __u64 user_data; /* sqe->data submission passed back */ > __s32 res; /* result code for this event */ > __u32 flags; > }; > > Since 32 bit flags is not sufficient, we choose to return zone-relative offset > in sector/512b units. This can cover zone-size represented by chunk_sectors. > Applications will have the trouble to combine this with zone start to know > disk-relative offset. But if more bits are obtained by pulling from res field > that too would compel application to interpret res field differently, and it > seems more painstaking than the former option. > To keep uniformity, even with aio, zone-relative offset is returned. Urgh, no, that's dreadful. I'm not familiar with the io_uring code. Maybe the first 8 bytes of the user_data could be required to be the result offset for this submission type? > Block IO vs File IO ---> > For now, the user zone-append interface is supported only for zoned-block-device. > Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) > will not need this anyway, because zone peculiarities are abstracted within FS. > At this point, ZoneFS also likes to use append implicitly rather than explicitly. > But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check > allowing-only-block-device should be changed. But we also have O_APPEND files. And maybe we'll have other kinds of file in future for which this would make sense. > Semantics ---> > Zone-append, by its nature, may perform write on a different location than what > was specified. It does not fit into POSIX, and trying to fit may just undermine ... I disagree that it doesn't fit into POSIX. As I said above, O_APPEND is a POSIX concept, so POSIX already understands that writes may not end up at the current write pointer. > its benefit. It may be better to keep semantics as close to zone-append as > possible i.e. specify zone-start location, and obtain the actual-write location > post completion. Towards that goal, existing async APIs seem to fit fine. > Async APIs (uring, linux aio) do not work on implicit write-pointer and demand > explicit write offset (which is what we need for append). Neither write-pointer > is taken as input, nor it is updated on completion. And there is a clear way to > get zone-append result. Zone-aware applications while using these async APIs > can be fine with, for the lack of better word, zone-append semantics itself. > > Sync APIs work with implicit write-pointer (at least few of those), and there is > no way to obtain zone-append result, making it hard for user-space zone-append.