Re: replacement i_version counter for xfs

Jeff Layton <jlayton@xxxxxxxxxx> · Wed, 25 Jan 2023 12:58:08 -0500

On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > > On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote:
> > > > A few months ago, I posted a patch to make xfs not bump its i_version
> > > > counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
> > > > that xfs would need to replace it with an entirely new field as the
> > > > existing counter is used for other purposes and its semantics are set in
> > > > stone.
> > > > 
> > > > Has anything been done toward that end?
> > > 
> > > No, because we don't have official specification of the behaviour
> > > the nfsd subsystem requires merged into the kernel yet.
> > > 
> > 
> > Ok. Hopefully that will be addressed in v6.3.
> > 
> > > > Should I file a bug report or something?
> > > 
> > > There's nothing we can really do until the new specification is set
> > > in stone. Filing a bug report won't change anything material.
> > > 
> > > As it is, I'm guessing that you desire the behaviour to be as you
> > > described in the iversion patchset you just posted. That is
> > > effectively:
> > > 
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear larger to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > 
> > > i.e. the definition is changing from *any* metadata or data change
> > > to *explicit* metadata/data changes, right? i.e. it should only
> > > change when ctime changes?
> > > 
> > 
> > Yes.
> > 
> > > IIUC the rest of the justification for i_version is that ctime might
> > > lack the timestamp granularity to disambiguate sub-timestamp
> > > granularity changes, so i_version is needed to bridge that gap.
> > > 
> > > Given that XFS has nanosecond timestamp resolution in the on-disk
> > > format, both i_version and ctime changes are journalled, and
> > > ctime/i_version will always change at exactly the same time in the
> > > same transactions, there are no inherent sub-timestamp granularity
> > > problems with ctime within XFS. Any deficiency in ctime resolution
> > > comes solely from the granularity of the VFS inode timestamp
> > > functions.
> > > 
> > > And so if current_time() was to provide fine-grained nanosecond
> > > timestamp resolution for exported XFS filesystems (i.e. use
> > > ktime_get_real_ts64() conditionally), then it seems to me that the
> > > nfsd i_version function becomes completely redundant.
> > > 
> > > i.e. we are pretty much guaranteed that ctime on exported
> > > filesystems will always be different for explicit modifications to
> > > the same inode, and hence we can just use ctime as the version
> > > change identifier without needing any on-disk format changes at all.
> > > 
> > > And we can optimise away that overhead when the filesystem is not
> > > exported by just using the coarse timestamps because there is no
> > > need for sub-timer-tick disambiguation of single file
> > > modifications....
> > > 
> > 
> > Ok, so conditional on (maybe) a per fstype flag, and whether the
> > filesystem is exported?
> > 
> > It's not trivial to tell whether something is exported though. We
> > typically only do that sort of checking within nfsd. That involves an
> > upcall into mountd, at a minimum.
> > 
> > I don't think you want to be plumbing calls to exportfs into xfs for
> > this. It may be simpler to just add a new on-disk counter and be done
> > with it.
> 
> Simpler for you, maybe.  Ondisk format changes are a PITA to evaluate
> and come with a long support burden.  We'd also have to write
> xfs-specific testcases to ensure that the counter updates according to
> specification.
> 
> Poking the kernel to provide sub-jiffies timestamp granularity when
> required stays within the existing ondisk format, can be added to any
> filesystem with sufficient timestamp granularity, and can be the subject
> of a generic/ vfs test.
> 
> I also wonder if it's even necessary to use ktime_get_real_ts64 in all
> cases -- can we sample the coarse granularity timestamp, and only go for
> the higher resolution one if the first matches the ctime?
> 
> > > Hence it appears to me that with the new i_version specification
> > > that there's an avenue out of this problem entirely that is "nfsd
> > > needs to use ctime, not i_version". This solution seems generic
> > > enough that filesystems with existing on-disk nanosecond timestamp
> > > granularity would no longer need explicit on-disk support for the
> > > nfsd i_version functionality, yes?
> > > 
> > 
> > Pretty much.
> > 
> > My understanding has always been that it's not the on-disk format that's
> > the limiting factor, but the resolution of in-kernel timestamp sources.
> > If ktime_get_real_ts64 has real ns granularity, then that should be
> > sufficient (at least for the moment). I'm unclear on the performance
> > implications with such a change though.
> 
> I bet you can find some arm board or something with a terrible
> clocksource that will take forever to produce high resolution timestamps
> and get it wrong.
> 
> > You had also mentioned a while back that there was some desire for
> > femtosecond resolution on timestamps. Does that change the calculus here
> > at all? Note that the i_version is not subject to any timestamp
> > granularity issues.
> 
> I personally don't care to go enlarge xfs timestamps even further to
> support sub-ns resolution, but I see the theoretical argument for
> needing them on an 8GHz Intel i9-13900KS(paceheater)...
> 
> > If you want nfsd to start using the ctime for i_version with xfs, then
> > you can just turn off the SB_I_IVERSION flag. You will need to do some
> > work though to keep your "special" i_version that also counts atime
> > updates working once you turn that off. You'll probably want to do that
> > anyway though since the semantics for xfs's version counter are
> > different from everyone else's.
> > 
> > If this is what you choose to do for xfs, then the question becomes: who
> > is going to do that timestamp rework?
> > 
> > Note that there are two other lingering issues with i_version. Neither
> > of these are xfs-specific, but they may inform the changes you want to
> > make there:
> > 
> > 1/ the ctime and i_version can roll backward on a crash.
> > 
> > 2/ the ctime and i_version are both currently updated before write data
> > is copied to the pagecache. It would be ideal if that were done
> > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > for this, but they need a lot more testing.)
> 
> You might also want some means for xfs to tell the vfs that it already
> did the timestamp update (because, say, we had to allocate blocks).
> I wonder what people will say when we have to run a transaction before
> the write to peel off suid bits and another one after to update ctime.
> 

That's a great question! There is a related one too once I started
looking at this in more detail:

Most filesystems end up updating the timestamp via a the call to
file_update_time in __generic_file_write_iter. Today, that's called very
early in the function and if it fails, the write fails without changing
anything.

What do we do now if the write succeeds, but update_time fails? We don't
want to return an error on the write() since the data did get copied in.
Ignoring it seems wrong too though. There could even be some way to
exploit that by changing the contents while holding the timestamp and
version constant.

At this point I'm leaning toward leaving the ctime and i_version to be
updated before the write, and just bumping the i_version a second time
after. In most cases the second bump will end up being a no-op, unless
an i_version query races in between.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>