Re: [PATCH RFC 2/9] timekeeping: new interfaces for multigrain timestamp handing

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 25 Oct 2023 08:25:35 -0400

On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote:
> On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote:
> > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote:
> > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > 
> > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote:
> > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > > > > > 
> > > > > > The problem is the first read request after a modification has been
> > > > > > made. That is causing relatime to see mtime > atime and triggering
> > > > > > an atime update. XFS sees this, does an atime update, and in
> > > > > > committing that persistent inode metadata update, it calls
> > > > > > inode_maybe_inc_iversion(force = false) to check if an iversion
> > > > > > update is necessary. The VFS sees I_VERSION_QUERIED, and so it bumps
> > > > > > i_version and tells XFS to persist it.
> > > > > 
> > > > > Could we perhaps just have a mode where we don't increment i_version
> > > > > for just atime updates?
> > > > > 
> > > > > Maybe we don't even need a mode, and could just decide that atime
> > > > > updates aren't i_version updates at all?
> > > > 
> > > > We do that already - in memory atime updates don't bump i_version at
> > > > all. The issue is the rare persistent atime update requests that
> > > > still happen - they are the ones that trigger an i_version bump on
> > > > XFS, and one of the relatime heuristics tickle this specific issue.
> > > > 
> > > > If we push the problematic persistent atime updates to be in-memory
> > > > updates only, then the whole problem with i_version goes away....
> > > > 
> > > > > Yes, yes, it's obviously technically a "inode modification", but does
> > > > > anybody actually *want* atime updates with no actual other changes to
> > > > > be version events?
> > > > 
> > > > Well, yes, there was. That's why we defined i_version in the on disk
> > > > format this way well over a decade ago. It was part of some deep
> > > > dark magical HSM beans that allowed the application to combine
> > > > multiple scans for different inode metadata changes into a single
> > > > pass. atime changes was one of the things it needed to know about
> > > > for tiering and space scavenging purposes....
> > > > 
> > > 
> > > But if this is such an ancient mystical program, why do we have to
> > > keep this XFS behavior in the present?
> > > BTW, is this the same HSM whose DMAPI ioctls were deprecated
> > > a few years back?
> 
> Drop the attitude, Amir.
> 
> That "ancient mystical program" is this:
> 
> https://buy.hpe.com/us/en/enterprise-solutions/high-performance-computing-solutions/high-performance-computing-storage-solutions/hpc-storage-solutions/hpe-data-management-framework-7/p/1010144088
> 
> Yup, that product is backed by a proprietary descendent of the Irix
> XFS code base XFS that is DMAPI enabled and still in use today. It's
> called HPE XFS these days....
> 
> > > I mean, I understand that you do not want to change the behavior of
> > > i_version update without an opt-in config or mount option - let the distro
> > > make that choice.
> > > But calling this an "on-disk format change" is a very long stretch.
> 
> Telling the person who created, defined and implemented the on disk
> format that they don't know what constitutes a change of that
> on-disk format seems kinda Dunning-Kruger to me....
> 
> There are *lots* of ways that di_changecount is now incompatible
> with the VFS change counter. That's now defined as "i_version should
> only change when [cm]time is changed".
> 
> di_changecount is defined to be a count of the number of changes
> made to the attributes of the inode.  It's not just atime at issue
> here - we bump di_changecount when make any inode change, including
> background work that does not otherwise change timestamps. e.g.
> allocation at writeback time, unwritten extent conversion, on-disk
> EOF extension at IO completion, removal of speculative
> pre-allocation beyond EOF, etc.
> 
> IOWs, di_changecount was never defined as a linux "i_version"
> counter, regardless of the fact we originally we able to implement
> i_version with it - all extra bumps to di_changecount were not
> important to the users of i_version for about a decade.
> 
> Unfortunately, the new i_version definition is very much
> incompatible with the existing di_changecount definition and that's
> the underlying problem here. i.e. the problem is not that we bump
> i_version on atime, it's that di_changecount is now completely
> incompatible with the new i_version change semantics.
> 
> To implement the new i_version semantics exactly, we need to add a
> new field to the inode to hold this information.
> If we change the on disk format like this, then the atime
> problems go away because the new field would not get updated on
> atime updates. We'd still be bumping di_changecount on atime
> updates, though, because that's what is required by the on-disk
> format.
> 
> I'm really trying to avoid changing the on-disk format unless it
> is absolutely necessary. If we can get the in-memory timestamp
> updates to avoid tripping di_changecount updates then the atime
> problems go away.
> 
> If we can get [cm]time sufficiently fine grained that we don't need
> i_version, then we can turn off i_version in XFS and di_changecount
> ends up being entirely internal. That's what was attempted with
> generic multi-grain timestamps, but that hasn't worked.
> 
> Another options is for XFS to play it's own internal tricks with
> [cm]time granularity and turn off i_version. e.g. limit external
> timestamp visibility to 1us and use the remaining dozen bits of the
> ns field to hold a change counter for updates within a single coarse
> timer tick. This guarantees the timestamp changes within a coarse
> tick for the purposes of change detection, but we don't expose those
> bits to applications so applications that compare timestamps across
> inodes won't get things back to front like was happening with the
> multi-grain timestamps....
> 
> Another option is to work around the visible symptoms of the
> semantic mismatch between i_version and di_changecount. The only
> visible symptom we currently know about is the atime vs i_version
> issue.  If people are happy for us to simply ignore VFS atime
> guidelines (i.e. ignore realtime/lazytime) and do completely our own
> stuff with timestamp update deferal, then that also solve the
> immediate issues.
> 
> > > Does xfs_repair guarantee that changes of atime, or any inode changes
> > > for that matter, update i_version? No, it does not.
> > > So IMO, "atime does not update i_version" is not an "on-disk format change",
> > > it is a runtime behavior change, just like lazytime is.
> > 
> > This would certainly be my preference. I don't want to break any
> > existing users though.
> 
> That's why I'm trying to get some kind of consensus on what
> rules and/or atime configurations people are happy for me to break
> to make it look to users like there's a viable working change
> attribute being supplied by XFS without needing to change the on
> disk format.
> 

I agree that the only bone of contention is whether to count atime
updates against the change attribute. I think we have consensus that all
in-kernel users do _not_ want atime updates counted against the change
attribute. The only real question is these "legacy" users of
di_changecount.

> > Perhaps this ought to be a mkfs option? Existing XFS filesystems could
> > still behave with the legacy behavior, but we could make mkfs.xfs build
> > filesystems by default that work like NFS requires.
> 
> If we require mkfs to set a flag to change behaviour, then we're
> talking about making an explicit on-disk format change to select the
> optional behaviour. That's precisely what I want to avoid.
> 

Right. The on-disk di_changecount would have a (subtly) different
meaning at that point.

It's not a change that requires drastic retooling though. If we were to
do this, we wouldn't need to grow the on-disk inode. Booting to an older
kernel would cause the behavior to revert. That's sub-optimal, but not
fatal.

What I don't quite understand is how these tools are accessing
di_changecount?

XFS only accesses the di_changecount to propagate the value to and from
the i_version, and there is nothing besides NFSD and IMA that queries
the i_version value in-kernel. So, this must be done via some sort of
userland tool that is directly accessing the block device (or some 3rd
party kernel module).

In earlier discussions you alluded to some repair and/or analysis tools
that depended on this counter. I took a quick look in xfsprogs, but I
didn't see anything there. Is there a library or something that these
tools use to get at this value?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>