Re: [LSF/MM/BPF TOPIC] vfs write barriers

Jan Kara <jack@xxxxxxx> · Tue, 11 Feb 2025 15:53:41 +0100

On Thu 23-01-25 13:14:11, Jeff Layton wrote:
> On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > 
> > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > > Hi all,
> > > > 
> > > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > > 
> > > > The historical records state that I had mentioned the idea briefly at the end of
> > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > > its wider implications at the time.
> > > > 
> > > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > > 
> > > > This could be used by users - in the case of the prototype - an HSM service -
> > > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > > as the stricter fsfreeze() does.
> > > > 
> > > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > > implement a crash consistent change tracking method [3] without the
> > > > need to use the heavy fsfreeze() hammer.
> > > 
> > > How does this provide anything guarantee at all? It doesn't order or
> > > wait for physical IOs in any way, so writeback can be active on a
> > > file and writing data from both sides of a syscall write "barrier".
> > > i.e. there is no coherency between what is on disk, the cmtime of
> > > the inode and the write barrier itself.
> > > 
> > > Freeze is an actual physical write barrier. A very heavy handed
> > > physical right barrier, yes, but it has very well defined and
> > > bounded physical data persistence semantics.
> > 
> > Yes. Freeze is a "write barrier to persistence storage".
> > This is not what "vfs write barrier" is about.
> > I will try to explain better.
> > 
> > Some syscalls modify the data/metadata of filesystem objects in memory
> > (a.k.a "in-core") and some syscalls query in-core data/metadata
> > of filesystem objects.
> > 
> > It is often the case that in-core data/metadata readers are not fully
> > synchronized with in-core data/metadata writers and it is often that
> > in-core data and metadata are not modified atomically w.r.t the
> > in-core data/metadata readers.
> > Even related metadata attributes are often not modified atomically
> > w.r.t to their readers (e.g. statx()).
> > 
> > When it comes to "observing changes" multigrain ctime/mtime has
> > improved things a lot for observing a change in ctime/mtime since
> > last sampled and for observing an order of ctime/mtime changes
> > on different inodes, but it hasn't changed the fact that ctime/mtime
> > changes can be observed *before* the respective metadata/data
> > changes can be observed.
> > 
> > An example problem is that a naive backup or indexing program can
> > read old data/metadata with new timestamp T and wrongly conclude
> > that it read all changes up to time T.
> > 
> > It is true that "real" backup programs know that applications and
> > filesystem needs to be quisences before backup, but actual
> > day to day cloud storage sync programs and indexers cannot
> > practically freeze the filesystem for their work.
> > 
> 
> Right. That is still a known problem. For directory operations, the
> i_rwsem keeps things consistent, but for regular files, it's possible
> to see new timestamps alongside with old file contents. That's a
> problem since caching algorithms that watch for timestamp changes can
> end up not seeing the new contents until the _next_ change occurs,
> which might not ever happen.
> 
> It would be better to change the file write code to update the
> timestamps after copying data to the pagecache. It would still be
> possible in that case to see old attributes + new contents, but that's
> preferable to the reverse for callers that are watching for changes to
> attributes.
> 
> Would fixing that help your use-case at all?

I think Amir wanted to make here a point in the other direction: I.e., if
the application did:
 * sample inode timestamp
 * vfs_write_barrier()
 * read file data

then it is *guaranteed* it will never see old data & new timestamp and hence
the caching problem is solved. No need to update timestamp after the write.

Now I agree updating timestamps after write is much nicer from usability
POV (given how common pattern above it) but this is just a simple example
demonstrating possible uses for vfs_write_barrier().

								Honza

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR