On 2020/06/26 15:37, javier.gonz@xxxxxxxxxxx wrote: > On 26.06.2020 03:11, Damien Le Moal wrote: >> On 2020/06/26 2:18, Kanchan Joshi wrote: >>> [Revised as per feedback from Damien, Pavel, Jens, Christoph, Matias, Wilcox] >>> >>> This patchset enables zone-append using io-uring/linux-aio, on block IO path. >>> Purpose is to provide zone-append consumption ability to applications which are >>> using zoned-block-device directly. >>> >>> The application may specify RWF_ZONE_APPEND flag with write when it wants to >>> send zone-append. RWF_* flags work with a certain subset of APIs e.g. uring, >>> aio, and pwritev2. An error is reported if zone-append is requested using >>> pwritev2. It is not in the scope of this patchset to support pwritev2 or any >>> other sync write API for reasons described later. >>> >>> Zone-append completion result ---> >>> With zone-append, where write took place can only be known after completion. >>> So apart from usual return value of write, additional mean is needed to obtain >>> the actual written location. >>> >>> In aio, this is returned to application using res2 field of io_event - >>> >>> struct io_event { >>> __u64 data; /* the data field from the iocb */ >>> __u64 obj; /* what iocb this event came from */ >>> __s64 res; /* result code for this event */ >>> __s64 res2; /* secondary result */ >>> }; >>> >>> In io-uring, cqe->flags is repurposed for zone-append result. >>> >>> struct io_uring_cqe { >>> __u64 user_data; /* sqe->data submission passed back */ >>> __s32 res; /* result code for this event */ >>> __u32 flags; >>> }; >>> >>> Since 32 bit flags is not sufficient, we choose to return zone-relative offset >>> in sector/512b units. This can cover zone-size represented by chunk_sectors. >>> Applications will have the trouble to combine this with zone start to know >>> disk-relative offset. But if more bits are obtained by pulling from res field >>> that too would compel application to interpret res field differently, and it >>> seems more painstaking than the former option. >>> To keep uniformity, even with aio, zone-relative offset is returned. >> >> I am really not a fan of this, to say the least. The input is byte offset, the >> output is 512B relative sector count... Arg... We really cannot do better than >> that ? >> >> At the very least, byte relative offset ? The main reason is that this is >> _somewhat_ acceptable for raw block device accesses since the "sector" >> abstraction has a clear meaning, but once we add iomap/zonefs async zone append >> support, we really will want to have byte unit as the interface is regular >> files, not block device file. We could argue that 512B sector unit is still >> around even for files (e.g. block counts in file stat). Bu the different unit >> for input and output of one operation is really ugly. This is not nice for the user. >> > > You can refer to the discussion with Jens, Pavel and Alex on the uring > interface. With the bits we have and considering the maximun zone size > supported, there is no space for a byte relative offset. We can take > some bits from cqe->res, but we were afraid this is not very > future-proof. Do you have a better idea? If you can take 8 bits, that gives you 40 bits, enough to support byte relative offsets for any zone size defined as a number of 512B sectors using an unsigned int. Max zone size is 2^31 sectors in that case, so 2^40 bytes. Unless I am already too tired and my math is failing me... zone size is defined by chunk_sectors, which is used for raid and software raids too. This has been an unsigned int forever. I do not see the need for changing this to a 64bit anytime soon, if ever. A raid with a stripe size larger than 1TB does not really make any sense. Same for zone size... > > >>> >>> Append using io_uring fixed-buffer ---> >>> This is flagged as not-supported at the moment. Reason being, for fixed-buffer >>> io-uring sends iov_iter of bvec type. But current append-infra in block-layer >>> does not support such iov_iter. >>> >>> Block IO vs File IO ---> >>> For now, the user zone-append interface is supported only for zoned-block-device. >>> Regular files/block-devices are not supported. Regular file-system (e.g. F2FS) >>> will not need this anyway, because zone peculiarities are abstracted within FS. >>> At this point, ZoneFS also likes to use append implicitly rather than explicitly. >>> But if/when ZoneFS starts supporting explicit/on-demand zone-append, the check >>> allowing-only-block-device should be changed. >> >> Sure, but I think the interface is still a problem. I am not super happy about >> the 512B sector unit. Zonefs will be the only file system that will be impacted >> since other normal POSIX file system will not have zone append interface for >> users. So this is a limited problem. Still, even for raw block device files >> accesses, POSIX system calls use Byte unit everywhere. Let's try to use that. >> >> For aio, it is easy since res2 is unsigned long long. For io_uring, as discussed >> already, we can still 8 bits from the cqe res. All you need is to add a small >> helper function in userspace iouring.h to simplify the work of the application >> to get that result. > > Ok. See above. We can do this. > > Jens: Do you see this as a problem in the future? > > [...] > > Javier > -- Damien Le Moal Western Digital Research