On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > This proposed write barrier does not seem capable of providing any > > sort of physical data or metadata/data write ordering guarantees, so > > I'm a bit lost in how it can be used to provide reliable "crash > > consistent change tracking" when there is no relationship between > > the data/metadata in memory and data/metadata on disk... > > That's a good question. A bit hard to explain but I will try. > > The short answer is that the vfs write barrier does *not* by itself > provide the guarantee for "crash consistent change tracking". > > In the prototype, the "crash consistent change tracking" guarantee > is provided by the fact that the change records are recorded as > as metadata in the same filesystem, prior to the modification and > those metadata records are strictly ordered by the filesystem before > the actual change. Uh, ok. I've read the docco and I think I understand what the prototype you've pointed me at is doing. It is using a separate chunk of the filesystem as a database to persist change records for data in the filesystem. It is doing this by creating an empty(?) file per change record in a per time period (T) directory instance. i.e. write() -> pre-modify -> fanotify -> userspace HSM -> create file in dir T named "<filehandle-other-stuff>" And then you're relying on the filesystem to make that directory entry T/<filehandle-other-stuff> stable before the data the pre-modify record was generated for ever gets written. IOWs, you've specifically relying on *all unrelated metadata changes in the filesystem* having strict global ordering *and* being persisted before any data written after the metadata was created is persisted. Sure, this might work right now on XFS because the journalling implementation -currently- provides global metadata ordering and data/metadata ordering based on IO completion to submission ordering. However, we do not guarantee that XFS will -always- have this behaviour. This is an *implementation detail*, not a guaranteed behaviour we will preserve for all time. i.e. we reserve the right to change how we do unrelated metadata and data/metadata ordering internally. This reminds of how applications observed that ext3 ordered mode didn't require fsync to guarantee the data got written before the metadata, so they elided the fsync() because it was really expensive on ext3. i.e. they started relying on a specific filesystem implementation detail for "correct crash consistency behaviour", without understanding that it -only worked on ext3- and broken crash consistency behaviour on all other filesystems. That was *bad*, and it took a long time to get the message across that applications *must* use fsync() for correct crash consistency behaviour... What you are describing for your prototype HSM to provide crash consistent change tracking really seems to me like it is reliant on the side effects of specific filesystem implementation choices, not a behaviour that all filesysetms guarantee. i.e. not all filesystems provide strict global metadata ordering semantics, and some fs maintainers are on record explicitly stating that they will not provide or guarantee them. e.g. ext4, especially with fast commits enabled, will not provide global strictly ordered metadata semantics. btrfs also doesn't provide such a guarantee, either. > I would love to discuss the merits and pitfalls of this method, but the > main thing I wanted to get feedback on is whether anyone finds the > described vfs API useful for anything other that the change tracking > system that I described. If my understanding is correct, then this HSM prototype change tracking mechanism seems like a fragile, unsupportable architecture. I don't think we should be trying to add new VFS infrastructure to make it work, because I think the underlying behaviours it requires from filesystems are simply not guaranteed to exist. -Dave. -- Dave Chinner david@xxxxxxxxxxxxx