Re: [LSF/MM/BPF TOPIC] vfs write barriers

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 12 Feb 2025 08:12:03 +1100

On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote:
> On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> >
> > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > This proposed write barrier does not seem capable of providing any
> > > > sort of physical data or metadata/data write ordering guarantees, so
> > > > I'm a bit lost in how it can be used to provide reliable "crash
> > > > consistent change tracking" when there is no relationship between
> > > > the data/metadata in memory and data/metadata on disk...
> > >
> > > That's a good question. A bit hard to explain but I will try.
> > >
> > > The short answer is that the vfs write barrier does *not* by itself
> > > provide the guarantee for "crash consistent change tracking".
> > >
> > > In the prototype, the "crash consistent change tracking" guarantee
> > > is provided by the fact that the change records are recorded as
> > > as metadata in the same filesystem, prior to the modification and
> > > those metadata records are strictly ordered by the filesystem before
> > > the actual change.
> >
> > Uh, ok.
> >
> > I've read the docco and I think I understand what the prototype
> > you've pointed me at is doing.
> >
> > It is using a separate chunk of the filesystem as a database to
> > persist change records for data in the filesystem. It is doing this
> > by creating an empty(?) file per change record in a per time
> > period (T) directory instance.
> >
> > i.e.
> >
> > write()
> >  -> pre-modify
> >   -> fanotify
> >    -> userspace HSM
> >     -> create file in dir T named "<filehandle-other-stuff>"
> >
> > And then you're relying on the filesystem to make that directory
> > entry T/<filehandle-other-stuff> stable before the data the
> > pre-modify record was generated for ever gets written.
> >
> 
> Yes.
> 
> > IOWs, you've specifically relying on *all unrelated metadata changes
> > in the filesystem* having strict global ordering *and* being
> > persisted before any data written after the metadata was created
> > is persisted.
> >
> > Sure, this might work right now on XFS because the journalling
> > implementation -currently- provides global metadata ordering and
> > data/metadata ordering based on IO completion to submission
> > ordering.
> >
> 
> Yes.

[....]

> > > I would love to discuss the merits and pitfalls of this method, but the
> > > main thing I wanted to get feedback on is whether anyone finds the
> > > described vfs API useful for anything other that the change tracking
> > > system that I described.
> >
> > If my understanding is correct, then this HSM prototype change
> > tracking mechanism seems like a fragile, unsupportable architecture.
> > I don't think we should be trying to add new VFS infrastructure to
> > make it work, because I think the underlying behaviours it requires
> > from filesystems are simply not guaranteed to exist.
> >
> 
> That's a valid opinion.
> 
> Do you have an idea for a better design for fs agnostic change tracking?

Store your HSM metadata in a database on a different storage device
and only signal the pre-modification notification as complete once
the database has completed it's update transaction.

> I mean, sure, we can re-implement DMAPI in specific fs, but I don't think
> anyone would like that.

DMAPI pre-modification notifications didn't rely on side effects of
filesystem behaviour for correctness. The HSM had to guarantee that
it's recording of events were stable before it allowed the
modification to be done. Lots of dmapi modification notifications
used pre- and post- event notifications so the HSM could keep track
of modifications that were in flight at any given point in time.

That way the HSM recovery process knew after a crash which files it
needed to go look at to determine if the operation in progress had
completed or not once the system came back up....

> IMO The metadata ordering contract is a technical matter that could be fixed.
> 
> I still hold the opinion that the in-core changes order w.r.t readers
> is a problem
> regardless of persistence to disk, but I may need to come up with more
> compelling
> use cases to demonstrate this problem.

IIRC, the XFS DMAPI implementation solved that problem by blocking
read notifications whilst there was a pending modification
notification outstanding. The problem with the Linux DMAPI
implementation of this (one of the show stoppers that prevented
merge) was that it held a rwsem across syscall contexts to provide
this functionality.....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx