On Thu, 2022-08-18 at 11:07 +1000, Dave Chinner wrote: > On Wed, Aug 17, 2022 at 08:02:23AM -0400, Jeff Layton wrote: > > On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote: > > > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote: > > > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote: > > > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote: > > > > > > The i_version in xfs_trans_log_inode is bumped for any inode update, > > > > > > including atime-only updates due to reads. We don't want to record those > > > > > > in the i_version, as they don't represent "real" changes. Remove that > > > > > > callsite. > > > > > > > > > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the > > > > > > i_version and turn on XFS_ILOG_CORE if it happens. In > > > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being > > > > > > updated. > > > > > > > > > > What about operations that don't touch the mtime but change the file > > > > > metadata anyway? There are a few of those, like the blockgc garbage > > > > > collector, deduperange, and the defrag tool. > > > > > > > > > > > > > Do those change the c/mtime at all? > > > > > > > > It's possible we're missing some places that should change the i_version > > > > as well. We may need some more call sites. > > > > > > > > > Zooming out a bit -- what does i_version signal, concretely? I thought > > > > > it was used by nfs (and maybe ceph?) to signal to clients that the file > > > > > on the server has moved on, and the client needs to invalidate its > > > > > caches. I thought afs had a similar generation counter, though it's > > > > > only used to cache file data, not metadata? Does an i_version change > > > > > cause all of them to invalidate caches, or is there more behavior I > > > > > don't know about? > > > > > > > > > > > > > For NFS, it indicates a change to the change attribute indicates that > > > > there has been a change to the data or metadata for the file. atime > > > > changes due to reads are specifically exempted from this, but we do bump > > > > the i_version if someone (e.g.) changes the atime via utimes(). > > > > > > We have relatime behaviour to optimise away unnecessary atime > > > updates on reads. Trying to explicitly exclude i_version from atime > > > updates in one filesystem just because NFS doesn't need that > > > information seems .... misguided. The -on disk- i_version > > > field behaviour is defined by the filesystem implementation, not the > > > NFS requirements. > > > > -o relatime does not fix this. > > So why not fix -o relatime to handle this? That way the fix works > for all filesystems and doesn't require hacking around what the VFS > has told us to do in every filesystem. > > i.e. the VFS told us to update atime, we updated atime and that is a > persistent metadata change to the inode. Hence a filesystem with a > persistent change counter has to bump the change counter because > we've been asked by the VFS to make a persistent metadata change. > > If you want atime updates to not make persistent changes to on disk > metadata, then change the relatime implementation so that it doesn't > ask the filesystem to update the on-disk atime. > > Essentially, what I'm hearing is that NFS wants atime updates to > behave like lazytime, not like relatime. With lazytime, atime always > gets updated in memory, but it is not written back to the filesystem > until a timeout or some other modification is made to the inode or > file data. THe filesystem doesn't bump iversion until the timestamp > gets written back in the lazytime case. > > IOWs, we already have a mechanism in the kernel for making atime > updates behave exactly as NFS wants: -o lazytime. > No, that won't help. Both lazytime and relatime don't help anything since they don't address the fundamental problem, which is that the i_version changes due to atime updates. They only affect when the atime gets updated (or when it goes to disk). > > > > The NFS client will generally invalidate its caches for the inode when > > > > it notices a change attribute change. > > > > > > > > FWIW, AFS may not meet this standard since it doesn't generally > > > > increment the counter on metadata changes. It may turn out that we don't > > > > want to expose this to the AFS client due to that (or maybe come up with > > > > some way to indicate this difference). > > > > > > In XFS, we've defined the on-disk i_version field to mean > > > "increments with any persistent inode data or metadata change", > > > regardless of what the high level applications that use i_version > > > might actually require. > > > > > > That some network filesystem might only need a subset of the > > > metadata to be covered by i_version is largely irrelevant - if we > > > don't cover every persistent inode metadata change with i_version, > > > then applications that *need* stuff like atime change notification > > > can't be supported. > > > > > > > > Does that mean that we should bump i_version for any file data or > > > > > attribute that could be queried or observed by userspace? In which case > > > > > I suppose this change is still correct, even if it relaxes i_version > > > > > updates from "any change to the inode whatsoever" to "any change that > > > > > would bump mtime". Unless FIEMAP is part of "attributes observed by > > > > > userspace". > > > > > > > > > > (The other downside I can see is that now we have to remember to bump > > > > > timestamps for every new file operation we add, unlike the current code > > > > > which is centrally located in xfs_trans_log_inode.) > > > > > > > > > > > > > The main reason for the change attribute in NFS was that NFSv3 is > > > > plagued with cache-coherency problems due to coarse-grained timestamp > > > > granularity. It was conceived as a way to indicate that the inode had > > > > changed without relying on timestamps. > > > > > > Yes, and the most important design consideration for a filesystem is > > > that it -must be persistent-. The constraints on i_version are much > > > stricter than timestamps, and they are directly related to how the > > > filesystem persists metadata changes, not how metadata is changed or > > > accessed in memory. > > > > > > > In practice, we want to bump the i_version counter whenever the ctime or > > > > mtime would be changed. > > > > > > What about O_NOCMTIME modifications? What about lazytime > > > filesystems? These explicilty avoid or delay persisten c/mtime > > > updates, and that means bumping i_version only based on c/mtime > > > updates cannot be relied on. i_version is supposed to track user > > > visible data and metadata changes, *not timestamp updates*. > > > > I was speaking more about the sorts of activity that should result in > > the i_version being changed, not about tracking timestamp updates. IOW, > > if some activity would cause the mtime or ctime to change, then we want > > to also bump the i_version. > > > > Specifically, for NOCMTIME, I think we'd still want the i_version to > > change since that option is about timestamps and not i_version. > > Exactly my point: this is what XFS currently does. It is also what > your proposed changes break by tying i_version updates to c/mtime > updates. > > > > Hence, I don't think that trying to modify how filesystems persist > > > and maintain i_version coherency because NFS "doesn't need i_version > > > to cover atime updates" is the wrong approach. On-disk i_version > > > coherency has to work for more than just one NFS implementation > > > (especially now i_version will be exported to userspace!). > > > Persistent atime updates are already optimised away by relatime, and > > > so I think that any further atime filtering is largely a NFS > > > application layer problem and not something that should be solved by > > > changing the on-disk definition of back end filesystem structure > > > persistence. > > > > > > > Fair enough. xfs is not really in my wheelhouse so take this as a patch > > that helps illustrate the problem, rather than a serious submission. > > > > There are two consumers of the i_version counter today: the kernel NFS > > server and IMA. Having the i_version reflect atime updates due to reads > > harms both of those use-cases with unneeded cache invalidations on NFS > > and extra measurements on IMA. It would also be problematic for userland > > NFS servers such as ganesha if we end up exposing this to userland. > > So you're wanting define an exact behaviour for atime vs I_VERSION > where atime doesn't bump iversion. > > However, the definition we baked into the XFS on-disk format is that > the on-disk iversion filed changes for every persistent change made > to the inode. That includes atime updates. We did this for two > reasons - the first is so that we could support any application that > needed change detection, and the second was that knowing how many > changes have occurred to an inode is extremely useful for forensic > purposes (especially given the ability to use O_NOCMTIME to modify > file data). > > > atime updates are really a special case when it comes to metadata (and I > > maintain that they are one of the worst ideas in POSIX). The way you're > > choosing to define i_version doesn't really work properly for any > > current or projected use case. I'd like to see that changed. > > We chose to do that a decade ago, knowing that it is the > responsibility of the VFS to avoid unnecessary atime updates, not > the responsibility of the filesystem. That was the whole point of > introducing the relatime functionality: fix the problem at the VFS, > not have to work around generic atime behaviour in every filesystem. > > > If the concern is fragility of the code going forward, then maybe we can > > go with a different approach. Would it be possible to just have > > xfs_trans_log_inode skip bumping the i_version when the log transaction > > is _only_ for an atime update due to reads? Maybe we could add a new > > XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate > > codepaths? > > No, I don't think this is something we should be hacking around in > individual filesystems. If the VFS tells us we should be updating > atime, we should be updating it and bumping persistent change > counters because *that's what the VFS asked us to do*. > > IOWs, if NFS wants atime to be considered "in memory only" as per > the lazytime behaviour, then that behaviour needs to be > supported/changed at the VFS, not at the individual fileystem level. > > You could add a subset of SB_LAZYTIME functionality just for atime > updates, handle that entirely within the existing lazytime > infrastructure. THis means the VFS supports non-persistent, > best-effort, non-persistent atime updates directly. The VFS will not > ack filesystems to persist atime updates the moment they are made, > it will tell the filesystem via ->dirty_inode() that it needs to > persist the in-memory timestamps. > > This gives all the filesystems the same atime behaviour. If the VFS > is going to expose the persistent change counter to userspace, we > need to have consistent, VFS enforced rules on what is coverred by > the change counter. If atime is not covered by the ipersistent > change counter, then the VFS must not ask the filesystems to persist > atime changes every time atime changes. > > Then all filesystems will avoid on-disk atime updates as much as > possible, whilst still gathering them correctly when other > modifications are made. > > Until we've got a definition of what this persistent change counter > actually describes and guarantees for userspace, changing filesystem > implementations is premature. And if it is decided that atime is not > going to be considered a "persistent change" by itself, then that > behaviour needs to be encoded at the VFS, not in individual > filesystems.... > > Cheers, > > Dave. -- Jeff Layton <jlayton@xxxxxxxxxx>