Re: [PATCH] xfs: fix i_version handling in xfs

Dave Chinner <david@xxxxxxxxxxxxx> · Thu, 18 Aug 2022 11:07:27 +1000

On Wed, Aug 17, 2022 at 08:02:23AM -0400, Jeff Layton wrote:
> On Wed, 2022-08-17 at 08:42 +1000, Dave Chinner wrote:
> > On Tue, Aug 16, 2022 at 11:58:06AM -0400, Jeff Layton wrote:
> > > On Tue, 2022-08-16 at 08:43 -0700, Darrick J. Wong wrote:
> > > > On Tue, Aug 16, 2022 at 09:17:36AM -0400, Jeff Layton wrote:
> > > > > The i_version in xfs_trans_log_inode is bumped for any inode update,
> > > > > including atime-only updates due to reads. We don't want to record those
> > > > > in the i_version, as they don't represent "real" changes. Remove that
> > > > > callsite.
> > > > > 
> > > > > In xfs_vn_update_time, if S_VERSION is flagged, then attempt to bump the
> > > > > i_version and turn on XFS_ILOG_CORE if it happens. In
> > > > > xfs_trans_ichgtime, update the i_version if the mtime or ctime are being
> > > > > updated.
> > > > 
> > > > What about operations that don't touch the mtime but change the file
> > > > metadata anyway?  There are a few of those, like the blockgc garbage
> > > > collector, deduperange, and the defrag tool.
> > > > 
> > > 
> > > Do those change the c/mtime at all?
> > > 
> > > It's possible we're missing some places that should change the i_version
> > > as well. We may need some more call sites.
> > > 
> > > > Zooming out a bit -- what does i_version signal, concretely?  I thought
> > > > it was used by nfs (and maybe ceph?) to signal to clients that the file
> > > > on the server has moved on, and the client needs to invalidate its
> > > > caches.  I thought afs had a similar generation counter, though it's
> > > > only used to cache file data, not metadata?  Does an i_version change
> > > > cause all of them to invalidate caches, or is there more behavior I
> > > > don't know about?
> > > > 
> > > 
> > > For NFS, it indicates a change to the change attribute indicates that
> > > there has been a change to the data or metadata for the file. atime
> > > changes due to reads are specifically exempted from this, but we do bump
> > > the i_version if someone (e.g.) changes the atime via utimes(). 
> > 
> > We have relatime behaviour to optimise away unnecessary atime
> > updates on reads.  Trying to explicitly exclude i_version from atime
> > updates in one filesystem just because NFS doesn't need that
> > information seems ....  misguided.  The -on disk- i_version
> > field behaviour is defined by the filesystem implementation, not the
> > NFS requirements.
> 
> -o relatime does not fix this.

So why not fix -o relatime to handle this? That way the fix works
for all filesystems and doesn't require hacking around what the VFS
has told us to do in every filesystem.

i.e. the VFS told us to update atime, we updated atime and that is a
persistent metadata change to the inode. Hence a filesystem with a
persistent change counter has to bump the change counter because
we've been asked by the VFS to make a persistent metadata change.

If you want atime updates to not make persistent changes to on disk
metadata, then change the relatime implementation so that it doesn't
ask the filesystem to update the on-disk atime.

Essentially, what I'm hearing is that NFS wants atime updates to
behave like lazytime, not like relatime. With lazytime, atime always
gets updated in memory, but it is not written back to the filesystem
until a timeout or some other modification is made to the inode or
file data. THe filesystem doesn't bump iversion until the timestamp
gets written back in the lazytime case.

IOWs, we already have a mechanism in the kernel for making atime
updates behave exactly as NFS wants: -o lazytime.

> > > The NFS client will generally invalidate its caches for the inode when
> > > it notices a change attribute change.
> > > 
> > > FWIW, AFS may not meet this standard since it doesn't generally
> > > increment the counter on metadata changes. It may turn out that we don't
> > > want to expose this to the AFS client due to that (or maybe come up with
> > > some way to indicate this difference).
> > 
> > In XFS, we've defined the on-disk i_version field to mean
> > "increments with any persistent inode data or metadata change",
> > regardless of what the high level applications that use i_version
> > might actually require.
> > 
> > That some network filesystem might only need a subset of the
> > metadata to be covered by i_version is largely irrelevant - if we
> > don't cover every persistent inode metadata change with i_version,
> > then applications that *need* stuff like atime change notification
> > can't be supported.
> > 
> > > > Does that mean that we should bump i_version for any file data or
> > > > attribute that could be queried or observed by userspace?  In which case
> > > > I suppose this change is still correct, even if it relaxes i_version
> > > > updates from "any change to the inode whatsoever" to "any change that
> > > > would bump mtime".  Unless FIEMAP is part of "attributes observed by
> > > > userspace".
> > > > 
> > > > (The other downside I can see is that now we have to remember to bump
> > > > timestamps for every new file operation we add, unlike the current code
> > > > which is centrally located in xfs_trans_log_inode.)
> > > > 
> > > 
> > > The main reason for the change attribute in NFS was that NFSv3 is
> > > plagued with cache-coherency problems due to coarse-grained timestamp
> > > granularity. It was conceived as a way to indicate that the inode had
> > > changed without relying on timestamps.
> > 
> > Yes, and the most important design consideration for a filesystem is
> > that it -must be persistent-. The constraints on i_version are much
> > stricter than timestamps, and they are directly related to how the
> > filesystem persists metadata changes, not how metadata is changed or
> > accessed in memory.
> > 
> > > In practice, we want to bump the i_version counter whenever the ctime or
> > > mtime would be changed.
> > 
> > What about O_NOCMTIME modifications? What about lazytime
> > filesystems? These explicilty avoid or delay persisten c/mtime
> > updates, and that means bumping i_version only based on c/mtime
> > updates cannot be relied on. i_version is supposed to track user
> > visible data and metadata changes, *not timestamp updates*.
> 
> I was speaking more about the sorts of activity that should result in
> the i_version being changed, not about tracking timestamp updates. IOW,
> if some activity would cause the mtime or ctime to change, then we want
> to also bump the i_version.
> 
> Specifically, for NOCMTIME, I think we'd still want the i_version to
> change since that option is about timestamps and not i_version.

Exactly my point: this is what XFS currently does. It is also what
your proposed changes break by tying i_version updates to c/mtime
updates.

> > Hence, I don't think that trying to modify how filesystems persist
> > and maintain i_version coherency because NFS "doesn't need i_version
> > to cover atime updates" is the wrong approach. On-disk i_version
> > coherency has to work for more than just one NFS implementation
> > (especially now i_version will be exported to userspace!). 
> > Persistent atime updates are already optimised away by relatime, and
> > so I think that any further atime filtering is largely a NFS
> > application layer problem and not something that should be solved by
> > changing the on-disk definition of back end filesystem structure
> > persistence.
> > 
> 
> Fair enough. xfs is not really in my wheelhouse so take this as a patch
> that helps illustrate the problem, rather than a serious submission.
> 
> There are two consumers of the i_version counter today: the kernel NFS
> server and IMA. Having the i_version reflect atime updates due to reads
> harms both of those use-cases with unneeded cache invalidations on NFS
> and extra measurements on IMA. It would also be problematic for userland
> NFS servers such as ganesha if we end up exposing this to userland.

So you're wanting define an exact behaviour for atime vs I_VERSION
where atime doesn't bump iversion.

However, the definition we baked into the XFS on-disk format is that
the on-disk iversion filed changes for every persistent change made
to the inode. That includes atime updates. We did this for two
reasons - the first is so that we could support any application that
needed change detection, and the second was that knowing how many
changes have occurred to an inode is extremely useful for forensic
purposes (especially given the ability to use O_NOCMTIME to modify
file data).

> atime updates are really a special case when it comes to metadata (and I
> maintain that they are one of the worst ideas in POSIX). The way you're
> choosing to define i_version doesn't really work properly for any
> current or projected use case. I'd like to see that changed.

We chose to do that a decade ago, knowing that it is the
responsibility of the VFS to avoid unnecessary atime updates, not
the responsibility of the filesystem. That was the whole point of
introducing the relatime functionality: fix the problem at the VFS,
not have to work around generic atime behaviour in every filesystem.

> If the concern is fragility of the code going forward, then maybe we can
> go with a different approach. Would it be possible to just have
> xfs_trans_log_inode skip bumping the i_version when the log transaction
> is _only_ for an atime update due to reads? Maybe we could add a new
> XFS_ILOG_ATIME_UPDATE flag or something and set it the appropriate
> codepaths?

No, I don't think this is something we should be hacking around in
individual filesystems. If the VFS tells us we should be updating
atime, we should be updating it and bumping persistent change
counters because *that's what the VFS asked us to do*.

IOWs, if NFS wants atime to be considered "in memory only" as per
the lazytime behaviour, then that behaviour needs to be
supported/changed at the VFS, not at the individual fileystem level.

You could add a subset of SB_LAZYTIME functionality just for atime
updates, handle that entirely within the existing lazytime
infrastructure. THis means the VFS supports non-persistent,
best-effort, non-persistent atime updates directly. The VFS will not
ack filesystems to persist atime updates the moment they are made,
it will tell the filesystem via ->dirty_inode() that it needs to
persist the in-memory timestamps.

This gives all the filesystems the same atime behaviour. If the VFS
is going to expose the persistent change counter to userspace, we
need to have consistent, VFS enforced rules on what is coverred by
the change counter. If atime is not covered by the ipersistent
change counter, then the VFS must not ask the filesystems to persist
atime changes every time atime changes.

Then all filesystems will avoid on-disk atime updates as much as
possible, whilst still gathering them correctly when other
modifications are made.

Until we've got a definition of what this persistent change counter
actually describes and guarantees for userspace, changing filesystem
implementations is premature. And if it is decided that atime is not
going to be considered a "persistent change" by itself, then that
behaviour needs to be encoded at the VFS, not in individual
filesystems....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx