Re: [LSF/MM/BPF TOPIC] vfs write barriers

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 29 Jan 2025 02:39:56 +0100

On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > This proposed write barrier does not seem capable of providing any
> > > sort of physical data or metadata/data write ordering guarantees, so
> > > I'm a bit lost in how it can be used to provide reliable "crash
> > > consistent change tracking" when there is no relationship between
> > > the data/metadata in memory and data/metadata on disk...
> >
> > That's a good question. A bit hard to explain but I will try.
> >
> > The short answer is that the vfs write barrier does *not* by itself
> > provide the guarantee for "crash consistent change tracking".
> >
> > In the prototype, the "crash consistent change tracking" guarantee
> > is provided by the fact that the change records are recorded as
> > as metadata in the same filesystem, prior to the modification and
> > those metadata records are strictly ordered by the filesystem before
> > the actual change.
>
> Uh, ok.
>
> I've read the docco and I think I understand what the prototype
> you've pointed me at is doing.
>
> It is using a separate chunk of the filesystem as a database to
> persist change records for data in the filesystem. It is doing this
> by creating an empty(?) file per change record in a per time
> period (T) directory instance.
>
> i.e.
>
> write()
>  -> pre-modify
>   -> fanotify
>    -> userspace HSM
>     -> create file in dir T named "<filehandle-other-stuff>"
>
> And then you're relying on the filesystem to make that directory
> entry T/<filehandle-other-stuff> stable before the data the
> pre-modify record was generated for ever gets written.
>

Yes.

> IOWs, you've specifically relying on *all unrelated metadata changes
> in the filesystem* having strict global ordering *and* being
> persisted before any data written after the metadata was created
> is persisted.
>
> Sure, this might work right now on XFS because the journalling
> implementation -currently- provides global metadata ordering and
> data/metadata ordering based on IO completion to submission
> ordering.
>

Yes.

> However, we do not guarantee that XFS will -always- have this
> behaviour. This is an *implementation detail*, not a guaranteed
> behaviour we will preserve for all time. i.e. we reserve the right
> to change how we do unrelated metadata and data/metadata ordering
> internally.
>

Yes, that's why its a prototype, but its a userspace prototype.
The requirements from the kernel API won't change if the userspace
server would have used an independent nvram to store the change record.

> This reminds of how applications observed that ext3 ordered mode
> didn't require fsync to guarantee the data got written before the
> metadata, so they elided the fsync() because it was really expensive
> on ext3. i.e. they started relying on a specific filesystem
> implementation detail for "correct crash consistency behaviour",
> without understanding that it -only worked on ext3- and broken crash
> consistency behaviour on all other filesystems. That was *bad*, and
> it took a long time to get the message across that applications
> *must* use fsync() for correct crash consistency behaviour...

I am familiar with that episode.

>
> What you are describing for your prototype HSM to provide crash
> consistent change tracking really seems to me like it is reliant
> on the side effects of specific filesystem implementation choices,
> not a behaviour that all filesysetms guarantee.
>
> i.e. not all filesystems provide strict global metadata ordering
> semantics, and some fs maintainers are on record explicitly stating
> that they will not provide or guarantee them. e.g. ext4, especially
> with fast commits enabled, will not provide global strictly ordered
> metadata semantics. btrfs also doesn't provide such a guarantee,
> either.
>

Right. We did once a proposal to formalize this contract [1],
but its a bit off topic.

> > I would love to discuss the merits and pitfalls of this method, but the
> > main thing I wanted to get feedback on is whether anyone finds the
> > described vfs API useful for anything other that the change tracking
> > system that I described.
>
> If my understanding is correct, then this HSM prototype change
> tracking mechanism seems like a fragile, unsupportable architecture.
> I don't think we should be trying to add new VFS infrastructure to
> make it work, because I think the underlying behaviours it requires
> from filesystems are simply not guaranteed to exist.
>

That's a valid opinion.

Do you have an idea for a better design for fs agnostic change tracking?

I mean, sure, we can re-implement DMAPI in specific fs, but I don't think
anyone would like that.

IMO The metadata ordering contract is a technical matter that could be fixed.

I still hold the opinion that the in-core changes order w.r.t readers
is a problem
regardless of persistence to disk, but I may need to come up with more
compelling
use cases to demonstrate this problem.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@xxxxxxxxxxxxxx/