On Tue, Feb 11, 2025 at 10:12 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote: > > On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > This proposed write barrier does not seem capable of providing any > > > > > sort of physical data or metadata/data write ordering guarantees, so > > > > > I'm a bit lost in how it can be used to provide reliable "crash > > > > > consistent change tracking" when there is no relationship between > > > > > the data/metadata in memory and data/metadata on disk... > > > > > > > > That's a good question. A bit hard to explain but I will try. > > > > > > > > The short answer is that the vfs write barrier does *not* by itself > > > > provide the guarantee for "crash consistent change tracking". > > > > > > > > In the prototype, the "crash consistent change tracking" guarantee > > > > is provided by the fact that the change records are recorded as > > > > as metadata in the same filesystem, prior to the modification and > > > > those metadata records are strictly ordered by the filesystem before > > > > the actual change. > > > > > > Uh, ok. > > > > > > I've read the docco and I think I understand what the prototype > > > you've pointed me at is doing. > > > > > > It is using a separate chunk of the filesystem as a database to > > > persist change records for data in the filesystem. It is doing this > > > by creating an empty(?) file per change record in a per time > > > period (T) directory instance. > > > > > > i.e. > > > > > > write() > > > -> pre-modify > > > -> fanotify > > > -> userspace HSM > > > -> create file in dir T named "<filehandle-other-stuff>" > > > > > > And then you're relying on the filesystem to make that directory > > > entry T/<filehandle-other-stuff> stable before the data the > > > pre-modify record was generated for ever gets written. > > > > > > > Yes. > > > > > IOWs, you've specifically relying on *all unrelated metadata changes > > > in the filesystem* having strict global ordering *and* being > > > persisted before any data written after the metadata was created > > > is persisted. > > > > > > Sure, this might work right now on XFS because the journalling > > > implementation -currently- provides global metadata ordering and > > > data/metadata ordering based on IO completion to submission > > > ordering. > > > > > > > Yes. > > [....] > > > > > I would love to discuss the merits and pitfalls of this method, but the > > > > main thing I wanted to get feedback on is whether anyone finds the > > > > described vfs API useful for anything other that the change tracking > > > > system that I described. > > > > > > If my understanding is correct, then this HSM prototype change > > > tracking mechanism seems like a fragile, unsupportable architecture. > > > I don't think we should be trying to add new VFS infrastructure to > > > make it work, because I think the underlying behaviours it requires > > > from filesystems are simply not guaranteed to exist. > > > > > > > That's a valid opinion. > > > > Do you have an idea for a better design for fs agnostic change tracking? > > Store your HSM metadata in a database on a different storage device > and only signal the pre-modification notification as complete once > the database has completed it's update transaction. > Yes, naturally. This was exactly my point in saying that on-disk persistence is completely orthogonal to the purpose for which sb_write_barrier() API is being proposed. > > I mean, sure, we can re-implement DMAPI in specific fs, but I don't think > > anyone would like that. > > DMAPI pre-modification notifications didn't rely on side effects of > filesystem behaviour for correctness. Neither does fanotify. My HSM prototype is relying on some XFS side effects. A production HSM using the same fanotify API could store changes in a db on another fs or on persistent memory. > The HSM had to guarantee that > it's recording of events were stable before it allowed the > modification to be done. No change in methodology here. > Lots of dmapi modification notifications > used pre- and post- event notifications so the HSM could keep track > of modifications that were in flight at any given point in time. > OK, now we are talking about the relevant point. Persistent "recording" an intent to change on pre- is fine. "Notifying" the application that change has been done in pre- is racy, because the application may wrongly believe that it has already consumed the notified/recorded change. Complementing every single pre- event with a matching post- event is one possible solution and I think Jan and I discussed it as well. sb_write_barrier() is a much easier API for HSM, because HSM rarely needs to consume a single change, it is much more likely to consume a large batch of changes, so the sb_write_barrier() API is a much more efficient way of getting the same guarantee that "All the changes recorded with pre- events are observable". > That way the HSM recovery process knew after a crash which files it > needed to go look at to determine if the operation in progress had > completed or not once the system came back up.... > Yes, exactly what we need and what sb_write_barrier() helps to achieve. > > IMO The metadata ordering contract is a technical matter that could be fixed. > > > > I still hold the opinion that the in-core changes order w.r.t readers > > is a problem > > regardless of persistence to disk, but I may need to come up with more > > compelling > > use cases to demonstrate this problem. > > IIRC, the XFS DMAPI implementation solved that problem by blocking > read notifications whilst there was a pending modification > notification outstanding. The problem with the Linux DMAPI > implementation of this (one of the show stoppers that prevented > merge) was that it held a rwsem across syscall contexts to provide > this functionality..... > sb_write_barrier() allows HSM to archive the same end result without holding rwsem across syscalls context. It's literally SRCU instead of the DMAPI rwsem. Not more, not less: sb_start_write_srcu() --> notify change intent --> HSM record to changes db <-- ack change intent recorded <-- ... make in-core changes ... <-- wait for changes in-flight <-- sb_write_barrier() sb_end_write_srcu() --> ack changes in-flight --> <-- persist recorded changes <-- syncfs() persist in-core changes --> ack persist changes --> HSM notify change consumers Thanks, Amir.