Re: XFS: Approach to persist data & metadata changes on original file before IO acknowledgment when file is reflinked

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 3 Dec 2024 10:40:12 +1100

[please wrap your emails at 72 columns]

On Mon, Dec 02, 2024 at 10:41:02AM +0000, Mitta Sai Chaithanya wrote:
> Hi Team,

>       We are using XFS reflink feature to create snapshot of an
> origin file (a thick file, created through fallocate) and exposing
> origin file as a block device to the users.

So it's basically a loop block device?

All the questions you are asking are answered by studying how
drivers/block/loop.c translates block device integrity requests
to VFS operations on the backing file.

The block device API has a mechanism for triggering integrity
operations: REQ_PREFLUSH to flush volatile caches, and REQ_FUA to
ask for a specific IO to be persisted to stable storage.

The loop device translates REQ_PREFLUSH to vfs_fsync() on the
backing file, and REQ_FUA is emulated by write/vfs_fsync_range().
I have a patch to natively support REQ_FUA by converting it to
a RWF_DSYNC write call, which allows the underlying filesystem to
convert that data integrity write to a REQ_FUA write w/ O_DIRECT...

> XFS file was opened
> with O_DIRECT option to avoid buffers at lower layer while
> performing writes, even though a thick file is created, when user
> performs writes then there are metadata changes associated to
> writes (mostly xfs marks extents to know whether the data is
> written to physical blocks or not).

This is not specific to O_DIRECT - even buffered writes need to do
post data-IO unwritten extent conversion on the first write to any
file data.

> To avoid metadata changes
> during user writes we are explicitly zeroing entire file range
> post creation of file, so that there won't be any metadata changes
> in future for writes that happen on zeroed blocks.

Which makes fallocate() redundant. Simply writing zeroes to an empty
file will allocate and mark the extents as written.  Run fsync at
the end of the writes, and then there are no metadata updates other
than timestamps for future overwrites.

Until, of course ....

>       Now, if reflink copy of origin file is created then there
> will be metadata changes which need to be persisted if data is
> overwritten on the reflinked blocks of original file.

.... you share the data extents between multiple inodes.

Then every data write that needs to break extent sharing will
trigger a COW that allocates new extents, hence requiring metadata
modification both before the data IO is submitted and again after it
is completed.

> Even though
> the file is opened in O_DIRECT mode changes to metadata do not
> persist before write is acknowledged back to user,

O_DIRECT by itself does not imply -any- data integrity nor any
specific data/metadata ordering. Filesystems and block devices are
free to treat O_DIRECT writes in any way they want w.r.t. caching,
(lack of) crash resilience, etc.

> if system
> crashes when changes are in buffer then post recovery writes which
> were acknowledged are not available to read.

Well, yes.

You need to combine O_DIRECT with O_DSYNC/O_SYNC/f[data]sync for it
to have any meaning for the persistence of the data and metadata
needed to retrieve the data being written.

> Two options that we
> were aware to avoid consistency issue is:
> 
> 1. Adding O_SYNC flag while opening file which ensures each write
> gets persisted in persistent media, but this leads to poor
> performance.

"Poor performance" == "exact capability of the storage hardware to
persist data".

i.e. performance of O_DIRECT writes with data integrity requirement
is directly determined by the speed with which the storage device
can persist the data.

A filesystem like XFS will require two IOs to persist data written
to a newly allocated extent, and they are dependent writes. We have
no mechanism for telling block devices that they must order writes
to persistent storage in a specific manner, so our only tool for
ensuring that the block device orders the data and metadata IOs
correctly is to issue a REQ_PREFLUSH between the two IOs. We do this
with the journal IO, as it is the IO that requires all data writes
to be persistent before we persist the metadata in the journal....

If you can use AIO/io_uring, then the latency of these dependent
writes can be hidden as the process does not block waiting for two
IOs to complete. It can process more IO submissions whilst the data
integrity write is in flight. Then performance is not limited by
synchronous data integrity IO latency....

> 2. Performing sync operation along with writes/post writes will
> guarantees that metadata changes will be persisted.

Yes, but that will only result in faster IO if the fdatasync calls
are batched for multiple data IOs.

i.e. for O_DIRECT, fdatasync() is effectively REQ_PREFLUSH|REQ_FUA
journal write. If you are issuing one fdatasync per write, there
is no benefit over O_DSYNC.

And if you can batch O_DIRECT writes per fdatasync() call, then the
block device has a volatile cache, and then the upper block device
REQ_PREFLUSH and REQ_FUA operations need to be obeyed. IOWs, if
you can amortise fdatasync across multiple IOs, then you may as well
just advertise the device as having volatile caches and simply
rely on the upper filesystem (i.e. whatever is on the block device)
to issue data integrity flushes as appropriate....

i.e. this is the model the block loop device implements.

> Are there any other option available to avoid the above
> consistency issue (Without much degradation in performance)? 

There is little anyone can do to reduce the latency of individual IO
completion to stable storage - single threaded, synchronous data
integrity IO is always going to have a significant completion
latency penalty. To mitigate this the storage stack and/or
applications need to be architected to work in a way that isn't
directly IO latency sensitive.

As I said at the start, study the block loop device architecture and
pay attention to how it implements REQ_PREFLUSH and the AIO+DIO
backing file IO submission and completion signalling. 

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx