Re: [LSF/MM/BPF TOPIC] vfs write barriers

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 23 Jan 2025 13:14:11 -0500

On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > 
> > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > Hi all,
> > > 
> > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > 
> > > The historical records state that I had mentioned the idea briefly at the end of
> > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > its wider implications at the time.
> > > 
> > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > 
> > > This could be used by users - in the case of the prototype - an HSM service -
> > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > as the stricter fsfreeze() does.
> > > 
> > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > implement a crash consistent change tracking method [3] without the
> > > need to use the heavy fsfreeze() hammer.
> > 
> > How does this provide anything guarantee at all? It doesn't order or
> > wait for physical IOs in any way, so writeback can be active on a
> > file and writing data from both sides of a syscall write "barrier".
> > i.e. there is no coherency between what is on disk, the cmtime of
> > the inode and the write barrier itself.
> > 
> > Freeze is an actual physical write barrier. A very heavy handed
> > physical right barrier, yes, but it has very well defined and
> > bounded physical data persistence semantics.
> 
> Yes. Freeze is a "write barrier to persistence storage".
> This is not what "vfs write barrier" is about.
> I will try to explain better.
> 
> Some syscalls modify the data/metadata of filesystem objects in memory
> (a.k.a "in-core") and some syscalls query in-core data/metadata
> of filesystem objects.
> 
> It is often the case that in-core data/metadata readers are not fully
> synchronized with in-core data/metadata writers and it is often that
> in-core data and metadata are not modified atomically w.r.t the
> in-core data/metadata readers.
> Even related metadata attributes are often not modified atomically
> w.r.t to their readers (e.g. statx()).
> 
> When it comes to "observing changes" multigrain ctime/mtime has
> improved things a lot for observing a change in ctime/mtime since
> last sampled and for observing an order of ctime/mtime changes
> on different inodes, but it hasn't changed the fact that ctime/mtime
> changes can be observed *before* the respective metadata/data
> changes can be observed.
> 
> An example problem is that a naive backup or indexing program can
> read old data/metadata with new timestamp T and wrongly conclude
> that it read all changes up to time T.
> 
> It is true that "real" backup programs know that applications and
> filesystem needs to be quisences before backup, but actual
> day to day cloud storage sync programs and indexers cannot
> practically freeze the filesystem for their work.
> 

Right. That is still a known problem. For directory operations, the
i_rwsem keeps things consistent, but for regular files, it's possible
to see new timestamps alongside with old file contents. That's a
problem since caching algorithms that watch for timestamp changes can
end up not seeing the new contents until the _next_ change occurs,
which might not ever happen.

It would be better to change the file write code to update the
timestamps after copying data to the pagecache. It would still be
possible in that case to see old attributes + new contents, but that's
preferable to the reverse for callers that are watching for changes to
attributes.

Would fixing that help your use-case at all?

> For the HSM prototype, we track changes to a filesystem during
> a given time period by handling pre-modify vfs events and recording
> the file handles of changed objects.
> 
> sb_write_barrier(sb) provides an (internal so far) vfs API to wait
> for in-flight syscalls that can be still modifying user visible in-core
> data/metadata, without blocking new syscalls.
> 
> The method described in the HSM prototype [3] uses this API
> to persist the state that all the changes until time T were "observed".
> 
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.
> 
> The vfs write barrier allows to partition the change tracking records
> into overlapping time periods in a way that allows the *consumer* of
> the changes to consume the changes in a "crash consistent manner",
> because:
> 
> 1. All the in-core changes recorded before the barrier are fully
>     observable after the barrier
> 2. All the in-core changes that started after the barrier, will be recorded
>     for the future change query
> 
> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>