Re: [LSF/MM/BPF TOPIC] vfs write barriers

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 23 Jan 2025 11:34:29 +1100

On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> For the HSM prototype, we track changes to a filesystem during
> a given time period by handling pre-modify vfs events and recording
> the file handles of changed objects.
> 
> sb_write_barrier(sb) provides an (internal so far) vfs API to wait
> for in-flight syscalls that can be still modifying user visible in-core
> data/metadata, without blocking new syscalls.

Yes, I get this part. What I don't understand is how it is in any
way useful....

> The method described in the HSM prototype [3] uses this API
> to persist the state that all the changes until time T were "observed".
> 
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.

This doesn't make any sense to me - you seem to be making
assumptions that I know an awful lot about how your HSM prototype
works.

What's in a change record, when does it get written, what is it's
persistence semantics, what filesystem metadata is it being written
to? how does this relate to the actual dirty data that is
resident in the page cache that hasn't been written to stable
storage yet? Is there a another change record to say the data the
first change record tracks has been written to persistent storage?

> The vfs write barrier allows to partition the change tracking records
> into overlapping time periods in a way that allows the *consumer* of
> the changes to consume the changes in a "crash consistent manner",
> because:

> 1. All the in-core changes recorded before the barrier are fully
>     observable after the barrier
> 2. All the in-core changes that started after the barrier, will be recorded
>     for the future change query
> 
> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

This seems like a very specialised niche use case right now, but I
still have no clear idea how the application using this proposed
write barrier actually works to acheive the stated functionality
this feature provides it with...

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx