On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote: > Hello all, > I implemented a write workload by sequentially appending to the file > end using libaio aio_write in O_DIRECT mode (with proper offset and > buffer address alignment). When I reach a 1MB boundary I call > fallocate() to extend the file. Ah, yet another fallocate anti-pattern. Firstly, friends don't let friends use fallocate() with AIO+DIO. fallocate() serialises all IO to that file - it waits for existing IO to complete, and prevents new IO from being issued until the the fallocate() operation completes. It is a completely synchronous operation and it does not play well with non-blocking IO paths (AIO or io_uring). Put simply: fallocate() is an IO performance and scalability killer. If you need to *allocate* in aligned 1MB chunks, then use extent size hints to tell the filesystem to allocate 1MB aligned chunks when it does IO. This does not serialise all IO to the file like fallocate does, it acheives exactly the same result as using fallocate to extend the file, yet the application doesn't need to know anything about controlling file layout. Further, using DIO write()s call to extend the file rather than fallocate() or ftruncate() also means that there will always be data right up to the end of the file. That's because XFS will not update the file size on extension until the IO has completed, and making the file size extension persistent (i.e. journalling it) doesn't happen until the data has been made persistent via device cache flushes. IOWs, if the file has been extended by a write IO, then XFS has *guaranteed* that the data written to thatextended region has been persisted to disk before the size extension is persisted. > I need to protect the write from various failures such as disk unplug > / power failure. The bottom line is, once I ack a write-complete, > the user must be able to read it back later after a disk/power failure > and recovery. Fallocate() does not provide data integrity guarantees. The application needs to use O_DSYNC/RWF_DSYNC IO controls to tell the filesystem to provide data integrity guarnatees. > In my understanding, fallocate() will preallocate disk space for the > file, and I can call fsync to make sure the file metadata about this > new space is persisted when fallocate returns. Yes, but as it just contains zeros so if it is missing after a crash, what does it matter? It just looks like the file wasn't extended, and the application has to be able to recover from that situation already, yes? > Once aio_write returns > the data is in the disk. So it seems I don't need fsync after > aio-write completion, because (1) the data is in disk, and (2) the > file metadata to address the disk blocks is in disk. Wrong. Direct IO does not guarantee persistence when the write()/aio_write() completes. Even with direct IO, the data can be held in volatile caches in the storage stack and the data is not guaranteed to be persistent until directed by the application to be made persistent. > On the other hand, it seems XFS always does a delayed allocation > which might break my assumption that file=>disk space mapping is > persisted by fallocate. Wrong on many levels. The first is the same as above - fallocate() does not provide any data persistence guarantees. Secondly, DIO writes do not use delayed allocation because they can't - we have to issue the IO immediately, so there's nothign that can be delayed. IOWs, delayed allocation is only done for buffered IO. This is true for delayed allocation on both ext4 and btrfs as well. Further, on XFS buffered writes into preallocated space from fallocate() do not use delayed allocation either - the space is already allocated, so there's nothing to allocate and hence nothing to delay! To drive the point home even further: if you use extent size hints with buffered writes, then this also turns off delayed allocation and instead uses immediate allocation just like DIO writes to preallocate the aligned extent around the range being written. Lastly, if you write an fallocate() based algorithm that works "well" on XFS, there's every chance it's going to absolutely suck on a different filesystem (e.g. btrfs) because different filesystems have very different allocation policies and interact with preallocation very differently. IOWs, there's a major step between knowing what concepts like delayed allocation and preallocation do versus understanding the complex policies that filesystems weave around these concepts to make general purpose workloads perform optimally in most situations.... > I can improve the data-in-disk format to carry proper header/footer to > detect a broken write when scanning the file after a disk/power > failure. > > Given all those above, do I still need a fsync() after aio_write > completion in XFS to protect data persistence? Regardless of the filesystem, applications *always* need to use fsync/fdatasync/O_SYNC/O_DSYNC/RWF_DSYNC to guarantee data persistence. The filesystem doesn't provide any persistence guarantees in the absence of these application directives - guaranteeing user data integrity is the responsibility of the application manipulating the user data, not the filesystem. Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx