Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ?

Shawn <neutronsharc@xxxxxxxxx> · Mon, 21 Aug 2023 12:01:27 -0700

Hello Dave,
Thank you for your detailed reply.  That fallocate() thing makes a lot of sense.

I want to figure out the default extent size in my evn.  But
"xfs_info" doesn't seem to output it? (See below output)

Also, I want to use this cmd to set the default extent size hint, is
this correct?
$ sudo mkfs.xfs -d extszinherit=256    <== the data block is 4KB,  so
256 is 1MB.

$ sudo xfs_info  /dev/nvme3n1
meta-data=/dev/nvme3n1           isize=512    agcount=4, agsize=117210902 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=468843606, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=228927, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

regards,
Shawn

On Tue, Nov 29, 2022 at 1:34 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote:
> > Hello all,
> > I implemented a write workload by sequentially appending to the file
> > end using libaio aio_write in O_DIRECT mode (with proper offset and
> > buffer address alignment).  When I reach a 1MB boundary I call
> > fallocate() to extend the file.
>
> Ah, yet another fallocate anti-pattern.
>
> Firstly, friends don't let friends use fallocate() with AIO+DIO.
>
> fallocate() serialises all IO to that file - it waits for existing
> IO to complete, and prevents new IO from being issued until the
> the fallocate() operation completes. It is a completely synchronous
> operation and it does not play well with non-blocking IO paths (AIO
> or io_uring). Put simply: fallocate() is an IO performance and
> scalability killer.
>
> If you need to *allocate* in aligned 1MB chunks, then use extent
> size hints to tell the filesystem to allocate 1MB aligned chunks
> when it does IO. This does not serialise all IO to the file like
> fallocate does, it acheives exactly the same result as using
> fallocate to extend the file, yet the application doesn't need to
> know anything about controlling file layout.
>
> Further, using DIO write()s call to extend the file rather than
> fallocate() or ftruncate() also means that there will always be data
> right up to the end of the file.  That's because XFS will not update
> the file size on extension until the IO has completed, and making
> the file size extension persistent (i.e. journalling it) doesn't
> happen until the data has been made persistent via device cache
> flushes.
>
> IOWs, if the file has been extended by a write IO, then XFS has
> *guaranteed* that the data written to thatextended region has been
> persisted to disk before the size extension is persisted.
>
> > I need to protect the write from various failures such as disk unplug
> > / power failure.  The bottom line is,  once I ack a write-complete,
> > the user must be able to read it back later after a disk/power failure
> > and recovery.
>
> Fallocate() does not provide data integrity guarantees. The
> application needs to use O_DSYNC/RWF_DSYNC IO controls to tell the
> filesystem to provide data integrity guarnatees.
>
> > In my understanding,  fallocate() will preallocate disk space for the
> > file,  and I can call fsync to make sure the file metadata about this
> > new space is persisted when fallocate returns.
>
> Yes, but as it just contains zeros so if it is missing after a
> crash, what does it matter? It just looks like the file wasn't
> extended, and the application has to be able to recover from that
> situation already, yes?
>
> > Once aio_write returns
> > the data is in the disk.  So it seems I don't need fsync after
> > aio-write completion, because (1) the data is in disk,  and (2) the
> > file metadata to address the disk blocks is in disk.
>
> Wrong. Direct IO does not guarantee persistence when the
> write()/aio_write() completes. Even with direct IO, the data can be
> held in volatile caches in the storage stack and the data is not
> guaranteed to be persistent until directed by the application to be
> made persistent.
>
> > On the other hand, it seems XFS always does a delayed allocation
> > which might break my assumption that file=>disk space mapping is
> > persisted by fallocate.
>
> Wrong on many levels. The first is the same as above - fallocate()
> does not provide any data persistence guarantees.
>
> Secondly, DIO writes do not use delayed allocation because they
> can't - we have to issue the IO immediately, so there's nothign that
> can be delayed. IOWs, delayed allocation is only done for buffered
> IO. This is true for delayed allocation on both ext4 and btrfs as
> well.
>
> Further, on XFS buffered writes into preallocated space from
> fallocate() do not use delayed allocation either - the space is
> already allocated, so there's nothing to allocate and hence nothing
> to delay!
>
> To drive the point home even further: if you use extent size
> hints with buffered writes, then this also turns off delayed
> allocation and instead uses immediate allocation just like DIO
> writes to preallocate the aligned extent around the range being
> written.
>
> Lastly, if you write an fallocate() based algorithm that works
> "well" on XFS, there's every chance it's going to absolutely suck on
> a different filesystem (e.g. btrfs) because different filesystems
> have very different allocation policies and interact with
> preallocation very differently.
>
> IOWs, there's a major step between knowing what concepts like
> delayed allocation and preallocation do versus understanding the
> complex policies that filesystems weave around these concepts to
> make general purpose workloads perform optimally in most
> situations....
>
> > I can improve the data-in-disk format to carry proper header/footer to
> > detect a broken write when scanning the file after a disk/power
> > failure.
> >
> > Given all those above,  do I still need a fsync() after aio_write
> > completion in XFS to protect data persistence?
>
> Regardless of the filesystem, applications *always* need to use
> fsync/fdatasync/O_SYNC/O_DSYNC/RWF_DSYNC to guarantee data
> persistence. The filesystem doesn't provide any persistence
> guarantees in the absence of these application directives -
> guaranteeing user data integrity is the responsibility of the
> application manipulating the user data, not the filesystem.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx