On 2020/07/31 18:14, hch@xxxxxxxxxxxxx wrote: > On Fri, Jul 31, 2020 at 08:14:22AM +0000, Damien Le Moal wrote: >> >>> This was one of the reason why we chose to isolate the operation by a >>> different IOCB flag and not by IOCB_APPEND alone. >> >> For zonefs, the plan is: >> * For the sync write case, zone append is always used. >> * For the async write case, if we see IOCB_APPEND, then zone append BIOs are >> used. If not, regular write BIOs are used. >> >> Simple enough I think. No need for a new flag. > > Simple, but wrong. Sync vs async really doesn't matter, even sync > writes will have problems if there are other writers. We need a flag > for "report the actual offset for appending writes", and based on that > flag we need to not allow short writes (or split extents for real > file systems). We also need a fcntl or ioctl to report this max atomic > write size so that applications can rely on it. > Sync writes are done under the inode lock, so there cannot be other writers at the same time. And for the sync case, since the actual written offset is necessarily equal to the file size before the write, there is no need to report it (there is no system call that can report that anyway). For this sync case, the only change that the use of zone append introduces compared to regular writes is the potential for more short writes. Adding a flag for "report the actual offset for appending writes" is fine with me, but do you also mean to use this flag for driving zone append write vs regular writes in zonefs ? The fcntl or ioctl for getting the max atomic write size would be fine too. Given that zonefs is very close to the underlying zoned drive, I was assuming that the application can simply consult the device sysfs zone_append_max_bytes queue attribute. For regular file systems, this value would be used internally only. I do not really see how it can be useful to applications. Furthermore, the file system may have a hard time giving that information to the application depending on its underlying storage configuration (e.g. erasure coding/declustered RAID). -- Damien Le Moal Western Digital Research