Re: [LSF/MM/BPF TOPIC] vfs write barriers

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 23 Jan 2025 15:01:11 +0100

On Thu, Jan 23, 2025 at 1:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > For the HSM prototype, we track changes to a filesystem during
> > a given time period by handling pre-modify vfs events and recording
> > the file handles of changed objects.
> >
> > sb_write_barrier(sb) provides an (internal so far) vfs API to wait
> > for in-flight syscalls that can be still modifying user visible in-core
> > data/metadata, without blocking new syscalls.
>
> Yes, I get this part. What I don't understand is how it is in any
> way useful....
>
> > The method described in the HSM prototype [3] uses this API
> > to persist the state that all the changes until time T were "observed".
> >
> > > This proposed write barrier does not seem capable of providing any
> > > sort of physical data or metadata/data write ordering guarantees, so
> > > I'm a bit lost in how it can be used to provide reliable "crash
> > > consistent change tracking" when there is no relationship between
> > > the data/metadata in memory and data/metadata on disk...
> >
> > That's a good question. A bit hard to explain but I will try.
> >
> > The short answer is that the vfs write barrier does *not* by itself
> > provide the guarantee for "crash consistent change tracking".
> >
> > In the prototype, the "crash consistent change tracking" guarantee
> > is provided by the fact that the change records are recorded as
> > as metadata in the same filesystem, prior to the modification and
> > those metadata records are strictly ordered by the filesystem before
> > the actual change.
>
> This doesn't make any sense to me - you seem to be making
> assumptions that I know an awful lot about how your HSM prototype
> works.
>
> What's in a change record

The prototype creates a directory entry of this name:

changed_dirs/$T/<directory file handle hex>

which gets created if it does not exist before a change in a directory
or before a change to a file's data/metadata [*].

[*] For non-dir, the change record is for ANY parent of the file
if the file is unlinked, we have no need to track changes
if the file is disconnected it's up to the HSM to decide if to block the change
or not record it

> when does it get written,

from handling of fanotify pre-modify events (not upstream yet)
*before* the change to in-core data/metadata
which are hooked inside {file,mnt}_want_write() wrappers
*before* {file,sb}_start_write()

> what is it's persistence semantics

The consumer (HSM service) is responsible for persisting
change records (e.g. by fsync of changed_dirs/$T/)

The only guarantee is expects from the filesystem is the
the change records (directory entries) are strictly ordered
to storage before data/metadata changes that are executed
after writing the change record.

> what filesystem metadata is it being written to?

For the prototype it is a directory index,
but that is an implementation detail of this prototype.

> how does this relate to the actual dirty data that is
> resident in the page cache that hasn't been written to stable
> storage yet?

The relation is as follows:
- HSM starts recording change records under both
  changed_dirs/$T/ and changed_dirs/$((T+1))/
- HSM calls sb_write_barrier() and syncfs()
- Then HSM stops recording changes in changed_dirs/$T/

So by the time changed_dirs/$T/ is "sealed", all the dirty data
will be either persistent in stable storage
OR also recorded in changed_dirs/$((T+1))/

> Is there a another change record to say the data the
> first change record tracks has been written to persistent storage?
>

Yes, I use a symlink to denote the "current" live change tracking session,
something like:

$ ln -sf $((T)) changed_dirs/current
...
$ ln -sf $((T+1)) changed_dirs/next
... (write barrier etc)
$ sync -f changed_dirs # seal current
$ mv changed_dirs/next changed_dirs/current

As you can see, I was trying to avoid tying the persistence semantics
to the kernel implementation of HSM.

As far as I can tell, the only thing I am missing from the kernel is
the vfs write barrier in order to take care of the rest in userspace.

Yes, there is this baby elephant in the room that "strictly ordered metadata"
is not in any contract, but I am willing to live with that for now, for the
benefits of filesystem agnostic HSM implementation.

> > The vfs write barrier allows to partition the change tracking records
> > into overlapping time periods in a way that allows the *consumer* of
> > the changes to consume the changes in a "crash consistent manner",
> > because:
>
> > 1. All the in-core changes recorded before the barrier are fully
> >     observable after the barrier
> > 2. All the in-core changes that started after the barrier, will be recorded
> >     for the future change query
> >
> > I would love to discuss the merits and pitfalls of this method, but the
> > main thing I wanted to get feedback on is whether anyone finds the
> > described vfs API useful for anything other that the change tracking
> > system that I described.
>
> This seems like a very specialised niche use case right now, but I
> still have no clear idea how the application using this proposed
> write barrier actually works to acheive the stated functionality
> this feature provides it with...
>

The problem that vfs write barrier is trying to solve is the problem
of order between changing and observing in-core data/metadata.
It seems like a problem that is more generic than my specialized
niche, but maybe it isn't.

The consumer of change tracking will start observing (reading)
the data/metadata only after sealing the period $T records,
so it avoids the risk of observing old data/metadata in a directory
recorded in period $T, without having another record in period $T+1.

The point in all this story is that the vfs write barrier is needed even if
there is no syncfs() at all and if the application does not care about
persistence at all.

For example, for an application that syncs files to a replica storage,
without the write barrier, the change query T can result in reading non-update
data/metadata and reach the incorrect conclusion that *everything is in sync*.

Thanks,
Amir.