Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

Amir Goldstein <amir73il@xxxxxxxxx> · Wed, 25 Oct 2023 13:41:46 +0300

On Wed, Oct 25, 2023 at 11:05 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote:
> > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > >
> > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > The problem is the first read request after a modification has been
> > > > > > made. That is causing relatime to see mtime > atime and triggering
> > > > > > an atime update. XFS sees this, does an atime update, and in
> > > > > > committing that persistent inode metadata update, it calls
> > > > > > inode_maybe_inc_iversion(force = false) to check if an iversion
> > > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > > > > > i_version and tells XFS to persist it.
> > > > >
> > > > > Could we perhaps just have a mode where we don't increment i_version
> > > > > for just atime updates?
> > > > >
> > > > > Maybe we don't even need a mode, and could just decide that atime
> > > > > updates aren't i_version updates at all?
> > > >
> > > > We do that already - in memory atime updates don't bump i_version at
> > > > all. The issue is the rare persistent atime update requests that
> > > > still happen - they are the ones that trigger an i_version bump on
> > > > XFS, and one of the relatime heuristics tickle this specific issue.
> > > >
> > > > If we push the problematic persistent atime updates to be in-memory
> > > > updates only, then the whole problem with i_version goes away....
> > > >
> > > > > Yes, yes, it's obviously technically a "inode modification", but does
> > > > > anybody actually *want* atime updates with no actual other changes to
> > > > > be version events?
> > > >
> > > > Well, yes, there was. That's why we defined i_version in the on disk
> > > > format this way well over a decade ago. It was part of some deep
> > > > dark magical HSM beans that allowed the application to combine
> > > > multiple scans for different inode metadata changes into a single
> > > > pass. atime changes was one of the things it needed to know about
> > > > for tiering and space scavenging purposes....
> > > >
> > >
> > > But if this is such an ancient mystical program, why do we have to
> > > keep this XFS behavior in the present?
> > > BTW, is this the same HSM whose DMAPI ioctls were deprecated
> > > a few years back?
>
> Drop the attitude, Amir.
>
> That "ancient mystical program" is this:
>
> https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing-solutions/high-performance-computing-storage-solutions/hpc-storage-solutions/hpe-data-management-framework-7/p/1010144088
>

Sorry for the attitude Dave, I somehow got the impression that you
were talking about a hypothetical old program that may be out of use.
I believe that Jeff and Linus got the same impression...

> Yup, that product is backed by a proprietary descendent of the Irix
> XFS code base XFS that is DMAPI enabled and still in use today. It's
> called HPE XFS these days....
>

What do you mean?
Do you mean that the HPE product uses patched XFS?
If so, why is that an upstream concern?

Upstream xfs indeed preserves di_dmstate,di_dmevmask, but it does
not change those state members when file changes happen.

So if mounting an HPE XFS disk on with upstream kernel is not
going to record DMAPI state changes, does it matter if upstream
xfs does not update di_changecount on atime change?

Maybe I did not understand the situation w.r.t HPE XFS.

> > > I mean, I understand that you do not want to change the behavior of
> > > i_version update without an opt-in config or mount option - let the distro
> > > make that choice.
> > > But calling this an "on-disk format change" is a very long stretch.
>
> Telling the person who created, defined and implemented the on disk
> format that they don't know what constitutes a change of that
> on-disk format seems kinda Dunning-Kruger to me....
>

OK. I will choose my words more carefully:

I still do not understand, from everything that you have told us
so far, including the mention of the specific product above,
why not updating di_changecount on atime update constitutes
an on-disk format change and not a runtime behavior change.

You also did not address my comment that xfs_repair does not
update di_changecount on any inode changes to the best of my
code reading abilities.

> There are *lots* of ways that di_changecount is now incompatible
> with the VFS change counter. That's now defined as "i_version should
> only change when [cm]time is changed".
>
> di_changecount is defined to be a count of the number of changes
> made to the attributes of the inode.  It's not just atime at issue
> here - we bump di_changecount when make any inode change, including
> background work that does not otherwise change timestamps. e.g.
> allocation at writeback time, unwritten extent conversion, on-disk
> EOF extension at IO completion, removal of speculative
> pre-allocation beyond EOF, etc.
>

I see.
Does xfs update ctime on all those inode block map changes?

> IOWs, di_changecount was never defined as a linux "i_version"
> counter, regardless of the fact we originally we able to implement
> i_version with it - all extra bumps to di_changecount were not
> important to the users of i_version for about a decade.
>
> Unfortunately, the new i_version definition is very much
> incompatible with the existing di_changecount definition and that's
> the underlying problem here. i.e. the problem is not that we bump
> i_version on atime, it's that di_changecount is now completely
> incompatible with the new i_version change semantics.
>
> To implement the new i_version semantics exactly, we need to add a
> new field to the inode to hold this information.
> If we change the on disk format like this, then the atime
> problems go away because the new field would not get updated on
> atime updates. We'd still be bumping di_changecount on atime
> updates, though, because that's what is required by the on-disk
> format.
>

I fully agree with you that we should avoid on-disk format change.
This is exactly the reason that I'm insisting on the point of clarifying
how exactly, this semantic change of di_changecount is going to
break existing applications that run on upstream kernel.

Thanks,
Amir.