Hello Dave, Thank you for your detailed reply. That fallocate() thing makes a lot of sense. I want to figure out the default extent size in my evn. But "xfs_info" doesn't seem to output it? (See below output) Also, I want to use this cmd to set the default extent size hint, is this correct? $ sudo mkfs.xfs -d extszinherit=256 <== the data block is 4KB, so 256 is 1MB. $ sudo xfs_info /dev/nvme3n1 meta-data=/dev/nvme3n1 isize=512 agcount=4, agsize=117210902 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0 spinodes=0 data = bsize=4096 blocks=468843606, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=228927, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 regards, Shawn On Tue, Nov 29, 2022 at 1:34 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote: > > Hello all, > > I implemented a write workload by sequentially appending to the file > > end using libaio aio_write in O_DIRECT mode (with proper offset and > > buffer address alignment). When I reach a 1MB boundary I call > > fallocate() to extend the file. > > Ah, yet another fallocate anti-pattern. > > Firstly, friends don't let friends use fallocate() with AIO+DIO. > > fallocate() serialises all IO to that file - it waits for existing > IO to complete, and prevents new IO from being issued until the > the fallocate() operation completes. It is a completely synchronous > operation and it does not play well with non-blocking IO paths (AIO > or io_uring). Put simply: fallocate() is an IO performance and > scalability killer. > > If you need to *allocate* in aligned 1MB chunks, then use extent > size hints to tell the filesystem to allocate 1MB aligned chunks > when it does IO. This does not serialise all IO to the file like > fallocate does, it acheives exactly the same result as using > fallocate to extend the file, yet the application doesn't need to > know anything about controlling file layout. > > Further, using DIO write()s call to extend the file rather than > fallocate() or ftruncate() also means that there will always be data > right up to the end of the file. That's because XFS will not update > the file size on extension until the IO has completed, and making > the file size extension persistent (i.e. journalling it) doesn't > happen until the data has been made persistent via device cache > flushes. > > IOWs, if the file has been extended by a write IO, then XFS has > *guaranteed* that the data written to thatextended region has been > persisted to disk before the size extension is persisted. > > > I need to protect the write from various failures such as disk unplug > > / power failure. The bottom line is, once I ack a write-complete, > > the user must be able to read it back later after a disk/power failure > > and recovery. > > Fallocate() does not provide data integrity guarantees. The > application needs to use O_DSYNC/RWF_DSYNC IO controls to tell the > filesystem to provide data integrity guarnatees. > > > In my understanding, fallocate() will preallocate disk space for the > > file, and I can call fsync to make sure the file metadata about this > > new space is persisted when fallocate returns. > > Yes, but as it just contains zeros so if it is missing after a > crash, what does it matter? It just looks like the file wasn't > extended, and the application has to be able to recover from that > situation already, yes? > > > Once aio_write returns > > the data is in the disk. So it seems I don't need fsync after > > aio-write completion, because (1) the data is in disk, and (2) the > > file metadata to address the disk blocks is in disk. > > Wrong. Direct IO does not guarantee persistence when the > write()/aio_write() completes. Even with direct IO, the data can be > held in volatile caches in the storage stack and the data is not > guaranteed to be persistent until directed by the application to be > made persistent. > > > On the other hand, it seems XFS always does a delayed allocation > > which might break my assumption that file=>disk space mapping is > > persisted by fallocate. > > Wrong on many levels. The first is the same as above - fallocate() > does not provide any data persistence guarantees. > > Secondly, DIO writes do not use delayed allocation because they > can't - we have to issue the IO immediately, so there's nothign that > can be delayed. IOWs, delayed allocation is only done for buffered > IO. This is true for delayed allocation on both ext4 and btrfs as > well. > > Further, on XFS buffered writes into preallocated space from > fallocate() do not use delayed allocation either - the space is > already allocated, so there's nothing to allocate and hence nothing > to delay! > > To drive the point home even further: if you use extent size > hints with buffered writes, then this also turns off delayed > allocation and instead uses immediate allocation just like DIO > writes to preallocate the aligned extent around the range being > written. > > Lastly, if you write an fallocate() based algorithm that works > "well" on XFS, there's every chance it's going to absolutely suck on > a different filesystem (e.g. btrfs) because different filesystems > have very different allocation policies and interact with > preallocation very differently. > > IOWs, there's a major step between knowing what concepts like > delayed allocation and preallocation do versus understanding the > complex policies that filesystems weave around these concepts to > make general purpose workloads perform optimally in most > situations.... > > > I can improve the data-in-disk format to carry proper header/footer to > > detect a broken write when scanning the file after a disk/power > > failure. > > > > Given all those above, do I still need a fsync() after aio_write > > completion in XFS to protect data persistence? > > Regardless of the filesystem, applications *always* need to use > fsync/fdatasync/O_SYNC/O_DSYNC/RWF_DSYNC to guarantee data > persistence. The filesystem doesn't provide any persistence > guarantees in the absence of these application directives - > guaranteeing user data integrity is the responsibility of the > application manipulating the user data, not the filesystem. > > Cheers, > > Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx