On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > This proposed write barrier does not seem capable of providing any > > > sort of physical data or metadata/data write ordering guarantees, so > > > I'm a bit lost in how it can be used to provide reliable "crash > > > consistent change tracking" when there is no relationship between > > > the data/metadata in memory and data/metadata on disk... > > > > That's a good question. A bit hard to explain but I will try. > > > > The short answer is that the vfs write barrier does *not* by itself > > provide the guarantee for "crash consistent change tracking". > > > > In the prototype, the "crash consistent change tracking" guarantee > > is provided by the fact that the change records are recorded as > > as metadata in the same filesystem, prior to the modification and > > those metadata records are strictly ordered by the filesystem before > > the actual change. > > Uh, ok. > > I've read the docco and I think I understand what the prototype > you've pointed me at is doing. > > It is using a separate chunk of the filesystem as a database to > persist change records for data in the filesystem. It is doing this > by creating an empty(?) file per change record in a per time > period (T) directory instance. > > i.e. > > write() > -> pre-modify > -> fanotify > -> userspace HSM > -> create file in dir T named "<filehandle-other-stuff>" > > And then you're relying on the filesystem to make that directory > entry T/<filehandle-other-stuff> stable before the data the > pre-modify record was generated for ever gets written. > Yes. > IOWs, you've specifically relying on *all unrelated metadata changes > in the filesystem* having strict global ordering *and* being > persisted before any data written after the metadata was created > is persisted. > > Sure, this might work right now on XFS because the journalling > implementation -currently- provides global metadata ordering and > data/metadata ordering based on IO completion to submission > ordering. > Yes. > However, we do not guarantee that XFS will -always- have this > behaviour. This is an *implementation detail*, not a guaranteed > behaviour we will preserve for all time. i.e. we reserve the right > to change how we do unrelated metadata and data/metadata ordering > internally. > Yes, that's why its a prototype, but its a userspace prototype. The requirements from the kernel API won't change if the userspace server would have used an independent nvram to store the change record. > This reminds of how applications observed that ext3 ordered mode > didn't require fsync to guarantee the data got written before the > metadata, so they elided the fsync() because it was really expensive > on ext3. i.e. they started relying on a specific filesystem > implementation detail for "correct crash consistency behaviour", > without understanding that it -only worked on ext3- and broken crash > consistency behaviour on all other filesystems. That was *bad*, and > it took a long time to get the message across that applications > *must* use fsync() for correct crash consistency behaviour... I am familiar with that episode. > > What you are describing for your prototype HSM to provide crash > consistent change tracking really seems to me like it is reliant > on the side effects of specific filesystem implementation choices, > not a behaviour that all filesysetms guarantee. > > i.e. not all filesystems provide strict global metadata ordering > semantics, and some fs maintainers are on record explicitly stating > that they will not provide or guarantee them. e.g. ext4, especially > with fast commits enabled, will not provide global strictly ordered > metadata semantics. btrfs also doesn't provide such a guarantee, > either. > Right. We did once a proposal to formalize this contract [1], but its a bit off topic. > > I would love to discuss the merits and pitfalls of this method, but the > > main thing I wanted to get feedback on is whether anyone finds the > > described vfs API useful for anything other that the change tracking > > system that I described. > > If my understanding is correct, then this HSM prototype change > tracking mechanism seems like a fragile, unsupportable architecture. > I don't think we should be trying to add new VFS infrastructure to > make it work, because I think the underlying behaviours it requires > from filesystems are simply not guaranteed to exist. > That's a valid opinion. Do you have an idea for a better design for fs agnostic change tracking? I mean, sure, we can re-implement DMAPI in specific fs, but I don't think anyone would like that. IMO The metadata ordering contract is a technical matter that could be fixed. I still hold the opinion that the in-core changes order w.r.t readers is a problem regardless of persistence to disk, but I may need to come up with more compelling use cases to demonstrate this problem. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@xxxxxxxxxxxxxx/