On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote: > On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote: > > On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote: > > > A few months ago, I posted a patch to make xfs not bump its i_version > > > counter on atime updates. Dave Chinner NAK'ed that patch, mentioning > > > that xfs would need to replace it with an entirely new field as the > > > existing counter is used for other purposes and its semantics are set in > > > stone. > > > > > > Has anything been done toward that end? > > > > No, because we don't have official specification of the behaviour > > the nfsd subsystem requires merged into the kernel yet. > > > > Ok. Hopefully that will be addressed in v6.3. > > > > Should I file a bug report or something? > > > > There's nothing we can really do until the new specification is set > > in stone. Filing a bug report won't change anything material. > > > > As it is, I'm guessing that you desire the behaviour to be as you > > described in the iversion patchset you just posted. That is > > effectively: > > > > * The change attribute (i_version) is mandated by NFSv4 and is mostly for > > * knfsd, but is also used for other purposes (e.g. IMA). The i_version must > > - * appear different to observers if there was a change to the inode's data or > > - * metadata since it was last queried. > > + * appear larger to observers if there was an explicit change to the inode's > > + * data or metadata since it was last queried. > > > > i.e. the definition is changing from *any* metadata or data change > > to *explicit* metadata/data changes, right? i.e. it should only > > change when ctime changes? > > > > Yes. > > > IIUC the rest of the justification for i_version is that ctime might > > lack the timestamp granularity to disambiguate sub-timestamp > > granularity changes, so i_version is needed to bridge that gap. > > > > Given that XFS has nanosecond timestamp resolution in the on-disk > > format, both i_version and ctime changes are journalled, and > > ctime/i_version will always change at exactly the same time in the > > same transactions, there are no inherent sub-timestamp granularity > > problems with ctime within XFS. Any deficiency in ctime resolution > > comes solely from the granularity of the VFS inode timestamp > > functions. > > > > And so if current_time() was to provide fine-grained nanosecond > > timestamp resolution for exported XFS filesystems (i.e. use > > ktime_get_real_ts64() conditionally), then it seems to me that the > > nfsd i_version function becomes completely redundant. > > > > i.e. we are pretty much guaranteed that ctime on exported > > filesystems will always be different for explicit modifications to > > the same inode, and hence we can just use ctime as the version > > change identifier without needing any on-disk format changes at all. > > > > And we can optimise away that overhead when the filesystem is not > > exported by just using the coarse timestamps because there is no > > need for sub-timer-tick disambiguation of single file > > modifications.... > > > > Ok, so conditional on (maybe) a per fstype flag, and whether the > filesystem is exported? > > It's not trivial to tell whether something is exported though. We > typically only do that sort of checking within nfsd. That involves an > upcall into mountd, at a minimum. > > I don't think you want to be plumbing calls to exportfs into xfs for > this. It may be simpler to just add a new on-disk counter and be done > with it. Simpler for you, maybe. Ondisk format changes are a PITA to evaluate and come with a long support burden. We'd also have to write xfs-specific testcases to ensure that the counter updates according to specification. Poking the kernel to provide sub-jiffies timestamp granularity when required stays within the existing ondisk format, can be added to any filesystem with sufficient timestamp granularity, and can be the subject of a generic/ vfs test. I also wonder if it's even necessary to use ktime_get_real_ts64 in all cases -- can we sample the coarse granularity timestamp, and only go for the higher resolution one if the first matches the ctime? > > Hence it appears to me that with the new i_version specification > > that there's an avenue out of this problem entirely that is "nfsd > > needs to use ctime, not i_version". This solution seems generic > > enough that filesystems with existing on-disk nanosecond timestamp > > granularity would no longer need explicit on-disk support for the > > nfsd i_version functionality, yes? > > > > Pretty much. > > My understanding has always been that it's not the on-disk format that's > the limiting factor, but the resolution of in-kernel timestamp sources. > If ktime_get_real_ts64 has real ns granularity, then that should be > sufficient (at least for the moment). I'm unclear on the performance > implications with such a change though. I bet you can find some arm board or something with a terrible clocksource that will take forever to produce high resolution timestamps and get it wrong. > You had also mentioned a while back that there was some desire for > femtosecond resolution on timestamps. Does that change the calculus here > at all? Note that the i_version is not subject to any timestamp > granularity issues. I personally don't care to go enlarge xfs timestamps even further to support sub-ns resolution, but I see the theoretical argument for needing them on an 8GHz Intel i9-13900KS(paceheater)... > If you want nfsd to start using the ctime for i_version with xfs, then > you can just turn off the SB_I_IVERSION flag. You will need to do some > work though to keep your "special" i_version that also counts atime > updates working once you turn that off. You'll probably want to do that > anyway though since the semantics for xfs's version counter are > different from everyone else's. > > If this is what you choose to do for xfs, then the question becomes: who > is going to do that timestamp rework? > > Note that there are two other lingering issues with i_version. Neither > of these are xfs-specific, but they may inform the changes you want to > make there: > > 1/ the ctime and i_version can roll backward on a crash. > > 2/ the ctime and i_version are both currently updated before write data > is copied to the pagecache. It would be ideal if that were done > afterward instead. (FWIW, I have some draft patches for btrfs and ext4 > for this, but they need a lot more testing.) You might also want some means for xfs to tell the vfs that it already did the timestamp update (because, say, we had to allocate blocks). I wonder what people will say when we have to run a transaction before the write to peel off suid bits and another one after to update ctime. --D > > -- > Jeff Layton <jlayton@xxxxxxxxxx>