On Thu, 2023-10-26 at 13:20 +1100, Dave Chinner wrote: > On Wed, Oct 25, 2023 at 08:25:35AM -0400, Jeff Layton wrote: > > On Wed, 2023-10-25 at 19:05 +1100, Dave Chinner wrote: > > > On Tue, Oct 24, 2023 at 02:40:06PM -0400, Jeff Layton wrote: > > > > On Tue, 2023-10-24 at 10:08 +0300, Amir Goldstein wrote: > > > > > On Tue, Oct 24, 2023 at 6:40 AM Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > > > > > > > > On Mon, Oct 23, 2023 at 02:18:12PM -1000, Linus Torvalds wrote: > > > > > > > On Mon, 23 Oct 2023 at 13:26, Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > > > Does xfs_repair guarantee that changes of atime, or any inode changes > > > > > for that matter, update i_version? No, it does not. > > > > > So IMO, "atime does not update i_version" is not an "on-disk format change", > > > > > it is a runtime behavior change, just like lazytime is. > > > > > > > > This would certainly be my preference. I don't want to break any > > > > existing users though. > > > > > > That's why I'm trying to get some kind of consensus on what > > > rules and/or atime configurations people are happy for me to break > > > to make it look to users like there's a viable working change > > > attribute being supplied by XFS without needing to change the on > > > disk format. > > > > > > > I agree that the only bone of contention is whether to count atime > > updates against the change attribute. I think we have consensus that all > > in-kernel users do _not_ want atime updates counted against the change > > attribute. The only real question is these "legacy" users of > > di_changecount. > > Please stop refering to "legacy users" of di_changecount. Whether > there are users or not is irrelevant - it is defined by the current > on-disk format specification, and as such there may be applications > we do not know about making use of the current behaviour. > > It's like a linux syscall - we can't remove them because there may > be some user we don't know about still using that old syscall. We > simply don't make changes that can potentially break user > applications like that. > > The on disk format is the same - there is software out that we don't > know about that expects a certain behaviour based on the > specification. We don't break the on disk format by making silent > behavioural changes - we require a feature flag to indicate > behaviour has changed so that applications can take appropriate > actions with stuff they don't understand. > > The example for this is the BIGTIME timestamp format change. The on > disk inode structure is physically unchanged, but the contents of > the timestamp fields are encoded very differently. Sure, the older > kernels can read the timestamp data without any sort of problem > occurring, except for the fact the timestamps now appear to be > completely corrupted. > > Changing the meaning of ithe contents of di_changecount is no > different. It might look OK and nothing crashes, but nothing can be > inferred from the value in the field because we don't know how it > has been modified. > > Hence we can't just change the meaning, encoding or behaviour of an > on disk field that would result in existing kernels and applications > doing the wrong thing with that field (either read or write) without > adding a feature flag to indicate what behaviour that field should > have. > > > > > Perhaps this ought to be a mkfs option? Existing XFS filesystems could > > > > still behave with the legacy behavior, but we could make mkfs.xfs build > > > > filesystems by default that work like NFS requires. > > > > > > If we require mkfs to set a flag to change behaviour, then we're > > > talking about making an explicit on-disk format change to select the > > > optional behaviour. That's precisely what I want to avoid. > > > > > > > Right. The on-disk di_changecount would have a (subtly) different > > meaning at that point. > > > > It's not a change that requires drastic retooling though. If we were to > > do this, we wouldn't need to grow the on-disk inode. Booting to an older > > kernel would cause the behavior to revert. That's sub-optimal, but not > > fatal. > > See above: redefining the contents, behaviour or encoding of an on > disk field is a change of the on-disk format specification. > > The rules for on disk format changes that we work to were set in > place long before I started working on XFS. They are sane, well > thought out rules that have stood the test of time and massive new > feature introductions (CRCs, reflink, rmap, etc). And they only work > because we don't allow anyone to bend them for convenience, short > cuts or expediting their pet project. > > > What I don't quite understand is how these tools are accessing > > di_changecount? > > As I keep saying: this is largely irrelevant to the problem at hand. > > > XFS only accesses the di_changecount to propagate the value to and from > > the i_version, > > Yes. XFS has a strong separation between on-disk structures and > in-memory values, and i_version is simply the in-memory field we use > to store the current di_changecount value. We force bump i_version > every time we modify the inode core regardless of whether anyone has > queried i_version because that's what di_changecount requires. i.e. > the filesystem controls the contents of i_version, not the VFS. > > Now that NFS is using a proper abstraction (i.e. vfs_statx()) to get > the change cookie, we really don't need to expose di_changecount in > i_version at all - we could simply copy an internal di_changecount > value into the statx cookie field in xfs_vn_getattr() and there > would be almost no change of behaviour from the perspective of NFS > and IMA at all. > > > and there is nothing besides NFSD and IMA that queries > > the i_version value in-kernel. So, this must be done via some sort of > > userland tool that is directly accessing the block device (or some 3rd > > party kernel module). > > Yup, both of those sort of applications exist. e.g. the DMAPI kernel > module allows direct access to inode metadata through a custom > bulkstat formatter implementation - it returns different information > comapred to the standard XFS one in the upstream kernel. > > > In earlier discussions you alluded to some repair and/or analysis tools > > that depended on this counter. > > Yes, and one of those "tools" is *me*. > > I frequently look at the di_changecount when doing forensic and/or > failure analysis on filesystem corpses. SOE analysis, relative > modification activity, etc all give insight into what happened to > the filesystem to get it into the state it is currently in, and > di_changecount provides information no other metadata in the inode > contains. > > > I took a quick look in xfsprogs, but I > > didn't see anything there. Is there a library or something that these > > tools use to get at this value? > > xfs_db is the tool I use for this, such as: > > $ sudo xfs_db -c "sb 0" -c "a rootino" -c "p v3.change_count" /dev/mapper/fast > v3.change_count = 35 > $ > > The root inode in this filesystem has a change count of 35. The root > inode has 32 dirents in it, which means that no entries have ever > been removed or renamed. This sort of insight into the past history > of inode metadata is largely impossible to get any other way, and > it's been the difference between understanding failure and having no > clue more than once. > > Most block device parsing applications simply write their own > decoder that walks the on-disk format. That's pretty trivial to do, > developers can get all the information needed to do this from the > on-disk format specification documentation we keep on kernel.org... > Fair enough. I'm not here to tell you that you guys that you need to change how di_changecount works. If it's too valuable to keep it counting atime-only updates, then so be it. If that's the case however, and given that the multigrain timestamp work is effectively dead, then I don't see an alternative to growing the on- disk inode. Do you? -- Jeff Layton <jlayton@xxxxxxxxxx>