On Tue, 2023-10-31 at 09:37 +1100, Dave Chinner wrote: > On Fri, Oct 27, 2023 at 06:35:58AM -0400, Jeff Layton wrote: > > On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote: > > > On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote: > > > > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote: > > > > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote: > > > > In earlier discussions you alluded to some repair and/or analysis tools > > > > that depended on this counter. > > > > > > Yes, and one of those "tools" is *me*. > > > > > > I frequently look at the di_changecount when doing forensic and/or > > > failure analysis on filesystem corpses. SOE analysis, relative > > > modification activity, etc all give insight into what happened to > > > the filesystem to get it into the state it is currently in, and > > > di_changecount provides information no other metadata in the inode > > > contains. > > > > > > > I took a quick look in xfsprogs, but I > > > > didn't see anything there. Is there a library or something that these > > > > tools use to get at this value? > > > > > > xfs_db is the tool I use for this, such as: > > > > > > $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/mapper/fast > > > v3.change_count = 35 > > > $ > > > > > > The root inode in this filesystem has a change count of 35. The root > > > inode has 32 dirents in it, which means that no entries have ever > > > been removed or renamed. This sort of insight into the past history > > > of inode metadata is largely impossible to get any other way, and > > > it's been the difference between understanding failure and having no > > > clue more than once. > > > > > > Most block device parsing applications simply write their own > > > decoder that walks the on-disk format. That's pretty trivial to do, > > > developers can get all the information needed to do this from the > > > on-disk format specification documentation we keep on kernel.org... > > > > > > > Fair enough. I'm not here to tell you that you guys that you need to > > change how di_changecount works. If it's too valuable to keep it > > counting atime-only updates, then so be it. > > > > If that's the case however, and given that the multigrain timestamp work > > is effectively dead, then I don't see an alternative to growing the on- > > disk inode. Do you? > > Yes, I do see alternatives. That's what I've been trying > (unsuccessfully) to describe and get consensus on. I feel like I'm > being ignored and rail-roaded here, because nobody is even > acknowledging that I'm proposing alternatives and keeps insisting > that the only solution is a change of on-disk format. > > So, I'll summarise the situation *yet again* in the hope that this > time I won't get people arguing about atime vs i-version and what > constitutes an on-disk format change because that goes nowhere and > does nothing to determine which solution might be acceptible. > > The basic situation is this: > > If XFS can ignore relatime or lazytime persistent updates for given > situations, then *we don't need to make periodic on-disk updates of > atime*. This makes the whole problem of "persistent atime update bumps > i_version" go away because then we *aren't making persistent atime > updates* except when some other persistent modification that bumps > [cm]time occurs. > > But I don't want to do this unconditionally - for systems not > running anything that samples i_version we want relatime/lazytime > to behave as they are supposed to and do periodic persistent updates > as per normal. Principle of least surprise and all that jazz. > > So we really need an indication for inodes that we should enable this > mode for the inode. I have asked if we can have per-operation > context flag to trigger this given the needs for io_uring to have > context flags for timestamp updates to be added. > > I have asked if we can have an inode flag set by the VFS or > application code for this. e.g. a flag set by nfsd whenever it accesses a > given inode. > > I have asked if this inode flag can just be triggered if we ever see > I_VERSION_QUERIED set or statx is used to retrieve a change cookie, > and whether this is a reliable mechanism for setting such a flag. > Ok, so to make sure I understand what you're proposing: This would be a new inode flag that would be set in conjunction with I_VERSION_QUERIED (but presumably is never cleared)? When XFS sees this flag set, it would skip sending the atime to disk. Given that you want to avoid on-disk changes, I assume this flag will not be stored on disk. What happens after the NFS server reboots? Consider: 1/ NFS server queries for the i_version and we set the I_NO_ATIME_UPDATES_ON_DISK flag (or whatever) in conjunction with I_VERSION_QUERIED. Some atime updates occur and the i_version isn't bumped (as you'd expect). 2/ The server then reboots. 3/ Server comes back up, and some local task issues a read against the inode. I_NO_ATIME_UPDATES_ON_DISK never had a chance to be set after the reboot, so that atime update ends up incrementing the i_version counter. 4/ client cache invalidation occurs even though there was no write to the file This might reduce some of the spurious i_version bumps, but I don't see how it can eliminate them entirely. > I have suggested mechanisms for using masked off bits of timestamps > to encode sub-timestamp granularity change counts and keep them > invisible to userspace and then not using i_version at all for XFS. > This avoids all the problems that the multi-grain timestamp > infrastructure exposed due to variable granularity of user visible > timestamps and ordering across inodes with different granularity. > This is potentially a general solution, too. > I don't really understand this at all, but trying to do anything with fine-grained timestamps will just run into a lot of the same problems we hit with the multigrain work. If you still see this as a path forward, maybe you can describe it more detail? > So, yeah, there are *lots* of ways we can solve this problem without > needing to change on-disk formats. > -- Jeff Layton <jlayton@xxxxxxxxxx>