Re: replacement i_version counter for xfs

"Darrick J. Wong" <djwong@xxxxxxxxxx> · Wed, 25 Jan 2023 08:32:58 -0800

On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote:
> > > A few months ago, I posted a patch to make xfs not bump its i_version
> > > counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
> > > that xfs would need to replace it with an entirely new field as the
> > > existing counter is used for other purposes and its semantics are set in
> > > stone.
> > > 
> > > Has anything been done toward that end?
> > 
> > No, because we don't have official specification of the behaviour
> > the nfsd subsystem requires merged into the kernel yet.
> > 
> 
> Ok. Hopefully that will be addressed in v6.3.
> 
> > > Should I file a bug report or something?
> > 
> > There's nothing we can really do until the new specification is set
> > in stone. Filing a bug report won't change anything material.
> > 
> > As it is, I'm guessing that you desire the behaviour to be as you
> > described in the iversion patchset you just posted. That is
> > effectively:
> > 
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear larger to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > 
> > i.e. the definition is changing from *any* metadata or data change
> > to *explicit* metadata/data changes, right? i.e. it should only
> > change when ctime changes?
> > 
> 
> Yes.
> 
> > IIUC the rest of the justification for i_version is that ctime might
> > lack the timestamp granularity to disambiguate sub-timestamp
> > granularity changes, so i_version is needed to bridge that gap.
> > 
> > Given that XFS has nanosecond timestamp resolution in the on-disk
> > format, both i_version and ctime changes are journalled, and
> > ctime/i_version will always change at exactly the same time in the
> > same transactions, there are no inherent sub-timestamp granularity
> > problems with ctime within XFS. Any deficiency in ctime resolution
> > comes solely from the granularity of the VFS inode timestamp
> > functions.
> > 
> > And so if current_time() was to provide fine-grained nanosecond
> > timestamp resolution for exported XFS filesystems (i.e. use
> > ktime_get_real_ts64() conditionally), then it seems to me that the
> > nfsd i_version function becomes completely redundant.
> > 
> > i.e. we are pretty much guaranteed that ctime on exported
> > filesystems will always be different for explicit modifications to
> > the same inode, and hence we can just use ctime as the version
> > change identifier without needing any on-disk format changes at all.
> > 
> > And we can optimise away that overhead when the filesystem is not
> > exported by just using the coarse timestamps because there is no
> > need for sub-timer-tick disambiguation of single file
> > modifications....
> > 
> 
> Ok, so conditional on (maybe) a per fstype flag, and whether the
> filesystem is exported?
> 
> It's not trivial to tell whether something is exported though. We
> typically only do that sort of checking within nfsd. That involves an
> upcall into mountd, at a minimum.
> 
> I don't think you want to be plumbing calls to exportfs into xfs for
> this. It may be simpler to just add a new on-disk counter and be done
> with it.

Simpler for you, maybe.  Ondisk format changes are a PITA to evaluate
and come with a long support burden.  We'd also have to write
xfs-specific testcases to ensure that the counter updates according to
specification.

Poking the kernel to provide sub-jiffies timestamp granularity when
required stays within the existing ondisk format, can be added to any
filesystem with sufficient timestamp granularity, and can be the subject
of a generic/ vfs test.

I also wonder if it's even necessary to use ktime_get_real_ts64 in all
cases -- can we sample the coarse granularity timestamp, and only go for
the higher resolution one if the first matches the ctime?

> > Hence it appears to me that with the new i_version specification
> > that there's an avenue out of this problem entirely that is "nfsd
> > needs to use ctime, not i_version". This solution seems generic
> > enough that filesystems with existing on-disk nanosecond timestamp
> > granularity would no longer need explicit on-disk support for the
> > nfsd i_version functionality, yes?
> > 
> 
> Pretty much.
> 
> My understanding has always been that it's not the on-disk format that's
> the limiting factor, but the resolution of in-kernel timestamp sources.
> If ktime_get_real_ts64 has real ns granularity, then that should be
> sufficient (at least for the moment). I'm unclear on the performance
> implications with such a change though.

I bet you can find some arm board or something with a terrible
clocksource that will take forever to produce high resolution timestamps
and get it wrong.

> You had also mentioned a while back that there was some desire for
> femtosecond resolution on timestamps. Does that change the calculus here
> at all? Note that the i_version is not subject to any timestamp
> granularity issues.

I personally don't care to go enlarge xfs timestamps even further to
support sub-ns resolution, but I see the theoretical argument for
needing them on an 8GHz Intel i9-13900KS(paceheater)...

> If you want nfsd to start using the ctime for i_version with xfs, then
> you can just turn off the SB_I_IVERSION flag. You will need to do some
> work though to keep your "special" i_version that also counts atime
> updates working once you turn that off. You'll probably want to do that
> anyway though since the semantics for xfs's version counter are
> different from everyone else's.
> 
> If this is what you choose to do for xfs, then the question becomes: who
> is going to do that timestamp rework?
> 
> Note that there are two other lingering issues with i_version. Neither
> of these are xfs-specific, but they may inform the changes you want to
> make there:
> 
> 1/ the ctime and i_version can roll backward on a crash.
> 
> 2/ the ctime and i_version are both currently updated before write data
> is copied to the pagecache. It would be ideal if that were done
> afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> for this, but they need a lot more testing.)

You might also want some means for xfs to tell the vfs that it already
did the timestamp update (because, say, we had to allocate blocks).
I wonder what people will say when we have to run a transaction before
the write to peel off suid bits and another one after to update ctime.

--D

> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>