Re: [LSF/MM/BPF TOPIC] vfs write barriers

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 12 Feb 2025 09:29:22 +0100

On Tue, Feb 11, 2025 at 10:12 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote:
> > On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > >
> > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > This proposed write barrier does not seem capable of providing any
> > > > > sort of physical data or metadata/data write ordering guarantees, so
> > > > > I'm a bit lost in how it can be used to provide reliable "crash
> > > > > consistent change tracking" when there is no relationship between
> > > > > the data/metadata in memory and data/metadata on disk...
> > > >
> > > > That's a good question. A bit hard to explain but I will try.
> > > >
> > > > The short answer is that the vfs write barrier does *not* by itself
> > > > provide the guarantee for "crash consistent change tracking".
> > > >
> > > > In the prototype, the "crash consistent change tracking" guarantee
> > > > is provided by the fact that the change records are recorded as
> > > > as metadata in the same filesystem, prior to the modification and
> > > > those metadata records are strictly ordered by the filesystem before
> > > > the actual change.
> > >
> > > Uh, ok.
> > >
> > > I've read the docco and I think I understand what the prototype
> > > you've pointed me at is doing.
> > >
> > > It is using a separate chunk of the filesystem as a database to
> > > persist change records for data in the filesystem. It is doing this
> > > by creating an empty(?) file per change record in a per time
> > > period (T) directory instance.
> > >
> > > i.e.
> > >
> > > write()
> > >  -> pre-modify
> > >   -> fanotify
> > >    -> userspace HSM
> > >     -> create file in dir T named "<filehandle-other-stuff>"
> > >
> > > And then you're relying on the filesystem to make that directory
> > > entry T/<filehandle-other-stuff> stable before the data the
> > > pre-modify record was generated for ever gets written.
> > >
> >
> > Yes.
> >
> > > IOWs, you've specifically relying on *all unrelated metadata changes
> > > in the filesystem* having strict global ordering *and* being
> > > persisted before any data written after the metadata was created
> > > is persisted.
> > >
> > > Sure, this might work right now on XFS because the journalling
> > > implementation -currently- provides global metadata ordering and
> > > data/metadata ordering based on IO completion to submission
> > > ordering.
> > >
> >
> > Yes.
>
> [....]
>
> > > > I would love to discuss the merits and pitfalls of this method, but the
> > > > main thing I wanted to get feedback on is whether anyone finds the
> > > > described vfs API useful for anything other that the change tracking
> > > > system that I described.
> > >
> > > If my understanding is correct, then this HSM prototype change
> > > tracking mechanism seems like a fragile, unsupportable architecture.
> > > I don't think we should be trying to add new VFS infrastructure to
> > > make it work, because I think the underlying behaviours it requires
> > > from filesystems are simply not guaranteed to exist.
> > >
> >
> > That's a valid opinion.
> >
> > Do you have an idea for a better design for fs agnostic change tracking?
>
> Store your HSM metadata in a database on a different storage device
> and only signal the pre-modification notification as complete once
> the database has completed it's update transaction.
>

Yes, naturally.
This was exactly my point in saying that on-disk persistence
is completely orthogonal to the purpose for which sb_write_barrier()
API is being proposed.

> > I mean, sure, we can re-implement DMAPI in specific fs, but I don't think
> > anyone would like that.
>
> DMAPI pre-modification notifications didn't rely on side effects of
> filesystem behaviour for correctness.

Neither does fanotify.
My HSM prototype is relying on some XFS side effects.
A production HSM using the same fanotify API could store
changes in a db on another fs or on persistent memory.

> The HSM had to guarantee that
> it's recording of events were stable before it allowed the
> modification to be done.

No change in methodology here.

> Lots of dmapi modification notifications
> used pre- and post- event notifications so the HSM could keep track
> of modifications that were in flight at any given point in time.
>

OK, now we are talking about the relevant point.
Persistent "recording" an intent to change on pre- is fine.
"Notifying" the application that change has been done in pre- is racy,
because the application may wrongly believe that it has already
consumed the notified/recorded change.

Complementing every single pre- event with a matching post-
event is one possible solution and I think Jan and I discussed it as well.
sb_write_barrier() is a much easier API for HSM, because HSM
rarely needs to consume a single change, it is much more likely
to consume a large batch of changes, so the sb_write_barrier() API
is a much more efficient way of getting the same guarantee that
"All the changes recorded with pre- events are observable".

> That way the HSM recovery process knew after a crash which files it
> needed to go look at to determine if the operation in progress had
> completed or not once the system came back up....
>

Yes, exactly what we need and what sb_write_barrier() helps to achieve.

> > IMO The metadata ordering contract is a technical matter that could be fixed.
> >
> > I still hold the opinion that the in-core changes order w.r.t readers
> > is a problem
> > regardless of persistence to disk, but I may need to come up with more
> > compelling
> > use cases to demonstrate this problem.
>
> IIRC, the XFS DMAPI implementation solved that problem by blocking
> read notifications whilst there was a pending modification
> notification outstanding. The problem with the Linux DMAPI
> implementation of this (one of the show stoppers that prevented
> merge) was that it held a rwsem across syscall contexts to provide
> this functionality.....
>

sb_write_barrier() allows HSM to archive the same end result without
holding rwsem across syscalls context.
It's literally SRCU instead of the DMAPI rwsem. Not more, not less:

sb_start_write_srcu() --> notify change intent --> HSM record to changes db
               <-- ack change intent recorded <--
...
make in-core changes
...
               <-- wait for changes in-flight <-- sb_write_barrier()
sb_end_write_srcu() --> ack changes in-flight -->
                 <-- persist recorded changes <-- syncfs()
persist in-core changes
                      --> ack persist changes --> HSM notify change consumers

Thanks,
Amir.