On Thu, Jan 23, 2025 at 7:14 PM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote: > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > > > Hi all, > > > > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > > > > > The historical records state that I had mentioned the idea briefly at the end of > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > > > its wider implications at the time. > > > > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > > > > > This could be used by users - in the case of the prototype - an HSM service - > > > > to wait for all in-flight write syscalls, without blocking new write syscalls > > > > as the stricter fsfreeze() does. > > > > > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > > > implement a crash consistent change tracking method [3] without the > > > > need to use the heavy fsfreeze() hammer. > > > > > > How does this provide anything guarantee at all? It doesn't order or > > > wait for physical IOs in any way, so writeback can be active on a > > > file and writing data from both sides of a syscall write "barrier". > > > i.e. there is no coherency between what is on disk, the cmtime of > > > the inode and the write barrier itself. > > > > > > Freeze is an actual physical write barrier. A very heavy handed > > > physical right barrier, yes, but it has very well defined and > > > bounded physical data persistence semantics. > > > > Yes. Freeze is a "write barrier to persistence storage". > > This is not what "vfs write barrier" is about. > > I will try to explain better. > > > > Some syscalls modify the data/metadata of filesystem objects in memory > > (a.k.a "in-core") and some syscalls query in-core data/metadata > > of filesystem objects. > > > > It is often the case that in-core data/metadata readers are not fully > > synchronized with in-core data/metadata writers and it is often that > > in-core data and metadata are not modified atomically w.r.t the > > in-core data/metadata readers. > > Even related metadata attributes are often not modified atomically > > w.r.t to their readers (e.g. statx()). > > > > When it comes to "observing changes" multigrain ctime/mtime has > > improved things a lot for observing a change in ctime/mtime since > > last sampled and for observing an order of ctime/mtime changes > > on different inodes, but it hasn't changed the fact that ctime/mtime > > changes can be observed *before* the respective metadata/data > > changes can be observed. > > > > An example problem is that a naive backup or indexing program can > > read old data/metadata with new timestamp T and wrongly conclude > > that it read all changes up to time T. > > > > It is true that "real" backup programs know that applications and > > filesystem needs to be quisences before backup, but actual > > day to day cloud storage sync programs and indexers cannot > > practically freeze the filesystem for their work. > > > > Right. That is still a known problem. For directory operations, the > i_rwsem keeps things consistent, but for regular files, it's possible > to see new timestamps alongside with old file contents. That's a > problem since caching algorithms that watch for timestamp changes can > end up not seeing the new contents until the _next_ change occurs, > which might not ever happen. > > It would be better to change the file write code to update the > timestamps after copying data to the pagecache. It would still be > possible in that case to see old attributes + new contents, but that's > preferable to the reverse for callers that are watching for changes to > attributes. > Yes, I remember this was discussed. I think it may make sense to update before and after copying data to page cache? > Would fixing that help your use-case at all? > I don't think it would, because my use case is not about querying the change status of a single inode. It post change timestamp update helps I don't see how. Thanks, Amir.