Re: [LSF/MM/BPF TOPIC] vfs write barriers

Dave Chinner <david@xxxxxxxxxxxxx> · Tue, 28 Jan 2025 10:34:06 +1100

On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.

Uh, ok.

I've read the docco and I think I understand what the prototype
you've pointed me at is doing.

It is using a separate chunk of the filesystem as a database to
persist change records for data in the filesystem. It is doing this
by creating an empty(?) file per change record in a per time
period (T) directory instance.

i.e.

write()
 -> pre-modify
  -> fanotify
   -> userspace HSM
    -> create file in dir T named "<filehandle-other-stuff>"

And then you're relying on the filesystem to make that directory
entry T/<filehandle-other-stuff> stable before the data the
pre-modify record was generated for ever gets written.

IOWs, you've specifically relying on *all unrelated metadata changes
in the filesystem* having strict global ordering *and* being
persisted before any data written after the metadata was created
is persisted.

Sure, this might work right now on XFS because the journalling
implementation -currently- provides global metadata ordering and
data/metadata ordering based on IO completion to submission
ordering.

However, we do not guarantee that XFS will -always- have this
behaviour. This is an *implementation detail*, not a guaranteed
behaviour we will preserve for all time. i.e. we reserve the right
to change how we do unrelated metadata and data/metadata ordering
internally.

This reminds of how applications observed that ext3 ordered mode
didn't require fsync to guarantee the data got written before the
metadata, so they elided the fsync() because it was really expensive
on ext3. i.e. they started relying on a specific filesystem
implementation detail for "correct crash consistency behaviour",
without understanding that it -only worked on ext3- and broken crash
consistency behaviour on all other filesystems. That was *bad*, and
it took a long time to get the message across that applications
*must* use fsync() for correct crash consistency behaviour...

What you are describing for your prototype HSM to provide crash
consistent change tracking really seems to me like it is reliant
on the side effects of specific filesystem implementation choices,
not a behaviour that all filesysetms guarantee.

i.e. not all filesystems provide strict global metadata ordering
semantics, and some fs maintainers are on record explicitly stating
that they will not provide or guarantee them. e.g. ext4, especially
with fast commits enabled, will not provide global strictly ordered
metadata semantics. btrfs also doesn't provide such a guarantee,
either.

> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

If my understanding is correct, then this HSM prototype change
tracking mechanism seems like a fragile, unsupportable architecture.
I don't think we should be trying to add new VFS infrastructure to
make it work, because I think the underlying behaviours it requires
from filesystems are simply not guaranteed to exist.

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx