Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 25 Oct 2023 19:05:25 +1100

On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote:
> > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > 
> > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > 
> > > > > The problem is the first read request after a modification has been
> > > > > made. That is causing relatime to see mtime > atime and triggering
> > > > > an atime update. XFS sees this, does an atime update, and in
> > > > > committing that persistent inode metadata update, it calls
> > > > > inode_maybe_inc_iversion(force = false) to check if an iversion
> > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > > > > i_version and tells XFS to persist it.
> > > > 
> > > > Could we perhaps just have a mode where we don't increment i_version
> > > > for just atime updates?
> > > > 
> > > > Maybe we don't even need a mode, and could just decide that atime
> > > > updates aren't i_version updates at all?
> > > 
> > > We do that already - in memory atime updates don't bump i_version at
> > > all. The issue is the rare persistent atime update requests that
> > > still happen - they are the ones that trigger an i_version bump on
> > > XFS, and one of the relatime heuristics tickle this specific issue.
> > > 
> > > If we push the problematic persistent atime updates to be in-memory
> > > updates only, then the whole problem with i_version goes away....
> > > 
> > > > Yes, yes, it's obviously technically a "inode modification", but does
> > > > anybody actually *want* atime updates with no actual other changes to
> > > > be version events?
> > > 
> > > Well, yes, there was. That's why we defined i_version in the on disk
> > > format this way well over a decade ago. It was part of some deep
> > > dark magical HSM beans that allowed the application to combine
> > > multiple scans for different inode metadata changes into a single
> > > pass. atime changes was one of the things it needed to know about
> > > for tiering and space scavenging purposes....
> > > 
> > 
> > But if this is such an ancient mystical program, why do we have to
> > keep this XFS behavior in the present?
> > BTW, is this the same HSM whose DMAPI ioctls were deprecated
> > a few years back?

Drop the attitude, Amir.

That "ancient mystical program" is this:

https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing-solutions/high-performance-computing-storage-solutions/hpc-storage-solutions/hpe-data-management-framework-7/p/1010144088

Yup, that product is backed by a proprietary descendent of the Irix
XFS code base XFS that is DMAPI enabled and still in use today. It's
called HPE XFS these days....

> > I mean, I understand that you do not want to change the behavior of
> > i_version update without an opt-in config or mount option - let the distro
> > make that choice.
> > But calling this an "on-disk format change" is a very long stretch.

Telling the person who created, defined and implemented the on disk
format that they don't know what constitutes a change of that
on-disk format seems kinda Dunning-Kruger to me....

There are *lots* of ways that di_changecount is now incompatible
with the VFS change counter. That's now defined as "i_version should
only change when [cm]time is changed".

di_changecount is defined to be a count of the number of changes
made to the attributes of the inode.  It's not just atime at issue
here - we bump di_changecount when make any inode change, including
background work that does not otherwise change timestamps. e.g.
allocation at writeback time, unwritten extent conversion, on-disk
EOF extension at IO completion, removal of speculative
pre-allocation beyond EOF, etc.

IOWs, di_changecount was never defined as a linux "i_version"
counter, regardless of the fact we originally we able to implement
i_version with it - all extra bumps to di_changecount were not
important to the users of i_version for about a decade.

Unfortunately, the new i_version definition is very much
incompatible with the existing di_changecount definition and that's
the underlying problem here. i.e. the problem is not that we bump
i_version on atime, it's that di_changecount is now completely
incompatible with the new i_version change semantics.

To implement the new i_version semantics exactly, we need to add a
new field to the inode to hold this information.
If we change the on disk format like this, then the atime
problems go away because the new field would not get updated on
atime updates. We'd still be bumping di_changecount on atime
updates, though, because that's what is required by the on-disk
format.

I'm really trying to avoid changing the on-disk format unless it
is absolutely necessary. If we can get the in-memory timestamp
updates to avoid tripping di_changecount updates then the atime
problems go away.

If we can get [cm]time sufficiently fine grained that we don't need
i_version, then we can turn off i_version in XFS and di_changecount
ends up being entirely internal. That's what was attempted with
generic multi-grain timestamps, but that hasn't worked.

Another options is for XFS to play it's own internal tricks with
[cm]time granularity and turn off i_version. e.g. limit external
timestamp visibility to 1us and use the remaining dozen bits of the
ns field to hold a change counter for updates within a single coarse
timer tick. This guarantees the timestamp changes within a coarse
tick for the purposes of change detection, but we don't expose those
bits to applications so applications that compare timestamps across
inodes won't get things back to front like was happening with the
multi-grain timestamps....

Another option is to work around the visible symptoms of the
semantic mismatch between i_version and di_changecount. The only
visible symptom we currently know about is the atime vs i_version
issue.  If people are happy for us to simply ignore VFS atime
guidelines (i.e. ignore realtime/lazytime) and do completely our own
stuff with timestamp update deferal, then that also solve the
immediate issues.

> > Does xfs_repair guarantee that changes of atime, or any inode changes
> > for that matter, update i_version? No, it does not.
> > So IMO, "atime does not update i_version" is not an "on-disk format change",
> > it is a runtime behavior change, just like lazytime is.
> 
> This would certainly be my preference. I don't want to break any
> existing users though.

That's why I'm trying to get some kind of consensus on what
rules and/or atime configurations people are happy for me to break
to make it look to users like there's a viable working change
attribute being supplied by XFS without needing to change the on
disk format.

> Perhaps this ought to be a mkfs option? Existing XFS filesystems could
> still behave with the legacy behavior, but we could make mkfs.xfs build
> filesystems by default that work like NFS requires.

If we require mkfs to set a flag to change behaviour, then we're
talking about making an explicit on-disk format change to select the
optional behaviour. That's precisely what I want to avoid.

-Dave.

-- 
Dave Chinner
david@xxxxxxxxxxxxx