Re: Do I have to fsync after aio_write finishes (with fallocate preallocation) ?

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 30 Nov 2022 08:34:36 +1100

On Tue, Nov 29, 2022 at 11:20:05AM -0800, Shawn wrote:
> Hello all,
> I implemented a write workload by sequentially appending to the file
> end using libaio aio_write in O_DIRECT mode (with proper offset and
> buffer address alignment).  When I reach a 1MB boundary I call
> fallocate() to extend the file.

Ah, yet another fallocate anti-pattern.

Firstly, friends don't let friends use fallocate() with AIO+DIO.

fallocate() serialises all IO to that file - it waits for existing
IO to complete, and prevents new IO from being issued until the
the fallocate() operation completes. It is a completely synchronous
operation and it does not play well with non-blocking IO paths (AIO
or io_uring). Put simply: fallocate() is an IO performance and
scalability killer.

If you need to *allocate* in aligned 1MB chunks, then use extent
size hints to tell the filesystem to allocate 1MB aligned chunks
when it does IO. This does not serialise all IO to the file like
fallocate does, it acheives exactly the same result as using
fallocate to extend the file, yet the application doesn't need to
know anything about controlling file layout.

Further, using DIO write()s call to extend the file rather than
fallocate() or ftruncate() also means that there will always be data
right up to the end of the file.  That's because XFS will not update
the file size on extension until the IO has completed, and making
the file size extension persistent (i.e. journalling it) doesn't
happen until the data has been made persistent via device cache
flushes.

IOWs, if the file has been extended by a write IO, then XFS has
*guaranteed* that the data written to thatextended region has been
persisted to disk before the size extension is persisted.

> I need to protect the write from various failures such as disk unplug
> / power failure.  The bottom line is,  once I ack a write-complete,
> the user must be able to read it back later after a disk/power failure
> and recovery.

Fallocate() does not provide data integrity guarantees. The
application needs to use O_DSYNC/RWF_DSYNC IO controls to tell the
filesystem to provide data integrity guarnatees.

> In my understanding,  fallocate() will preallocate disk space for the
> file,  and I can call fsync to make sure the file metadata about this
> new space is persisted when fallocate returns.

Yes, but as it just contains zeros so if it is missing after a
crash, what does it matter? It just looks like the file wasn't
extended, and the application has to be able to recover from that
situation already, yes?

> Once aio_write returns
> the data is in the disk.  So it seems I don't need fsync after
> aio-write completion, because (1) the data is in disk,  and (2) the
> file metadata to address the disk blocks is in disk.

Wrong. Direct IO does not guarantee persistence when the
write()/aio_write() completes. Even with direct IO, the data can be
held in volatile caches in the storage stack and the data is not
guaranteed to be persistent until directed by the application to be
made persistent.

> On the other hand, it seems XFS always does a delayed allocation
> which might break my assumption that file=>disk space mapping is
> persisted by fallocate.

Wrong on many levels. The first is the same as above - fallocate()
does not provide any data persistence guarantees.

Secondly, DIO writes do not use delayed allocation because they
can't - we have to issue the IO immediately, so there's nothign that
can be delayed. IOWs, delayed allocation is only done for buffered
IO. This is true for delayed allocation on both ext4 and btrfs as
well.

Further, on XFS buffered writes into preallocated space from
fallocate() do not use delayed allocation either - the space is
already allocated, so there's nothing to allocate and hence nothing
to delay!

To drive the point home even further: if you use extent size
hints with buffered writes, then this also turns off delayed
allocation and instead uses immediate allocation just like DIO
writes to preallocate the aligned extent around the range being
written.

Lastly, if you write an fallocate() based algorithm that works
"well" on XFS, there's every chance it's going to absolutely suck on
a different filesystem (e.g. btrfs) because different filesystems
have very different allocation policies and interact with
preallocation very differently.

IOWs, there's a major step between knowing what concepts like
delayed allocation and preallocation do versus understanding the
complex policies that filesystems weave around these concepts to
make general purpose workloads perform optimally in most
situations....

> I can improve the data-in-disk format to carry proper header/footer to
> detect a broken write when scanning the file after a disk/power
> failure.
>
> Given all those above,  do I still need a fsync() after aio_write
> completion in XFS to protect data persistence?

Regardless of the filesystem, applications *always* need to use
fsync/fdatasync/O_SYNC/O_DSYNC/RWF_DSYNC to guarantee data
persistence. The filesystem doesn't provide any persistence
guarantees in the absence of these application directives -
guaranteeing user data integrity is the responsibility of the
application manipulating the user data, not the filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx