On Wed, 24 Aug 2022, Jeff Layton wrote: > On Wed, 2022-08-24 at 08:24 +1000, NeilBrown wrote: > > On Tue, 23 Aug 2022, Jeff Layton wrote: > > > On Tue, 2022-08-23 at 21:38 +1000, NeilBrown wrote: > > > > On Tue, 23 Aug 2022, Jeff Layton wrote: > > > > > So, we can refer to that and simply say: > > > > > > > > > > "If the function updates the mtime or ctime on the inode, then the > > > > > i_version should be incremented. If only the atime is being updated, > > > > > then the i_version should not be incremented. The exception to this rule > > > > > is explicit atime updates via utimes() or similar mechanism, which > > > > > should result in the i_version being incremented." > > > > > > > > Is that exception needed? utimes() updates ctime. > > > > > > > > https://man7.org/linux/man-pages/man2/utimes.2.html > > > > > > > > doesn't say that, but > > > > > > > > https://pubs.opengroup.org/onlinepubs/007904875/functions/utimes.html > > > > > > > > does, as does the code. > > > > > > > > > > Oh, good point! I think we can leave that out. Even better! > > > > Further, implicit mtime updates (file_update_time()) also update ctime. > > So all you need is > > If the function updates the ctime, then i_version should be > > incremented. > > > > and I have to ask - why not just use the ctime? Why have another number > > that is parallel? > > > > Timestamps are updated at HZ (ktime_get_course) which is at most every > > millisecond. > > xfs stores nanosecond resolution, so about 20 bits are currently wasted. > > We could put a counter like i_version in there that only increments > > after it is viewed, then we can get all the precision we need but with > > exactly ctime semantics. > > > > The 64 change-id could comprise > > 35 bits of seconds (nearly a millenium) > > 16 bits of sub-seconds (just in case a higher precision time was wanted > > one day) > > 13 bits of counter. - 8192 changes per tick > > We'd need a "seen" flag too, so maybe only 4096 changes per tick... The "seen" flag does not need to be visible to NFSv4. Nor does it need to be appear on storage. Though it may still be easier to include it with the counter bits. > > > > > The value exposed in i_ctime would hide the counter and just show the > > timestamp portion of what the filesystem stores. This would ensure we > > never get changes on different files that happen in one order leaving > > timestamps with the reversed order (the timestamps could be the same, > > but that is expected). > > > > This scheme could be made to handle a sustained update rate of 1 > > increment every 8 nanoseconds (if the counter were allowed to overflow > > into unused bits of the sub-second field). This is one ever 24 CPU > > cycles. Incrementing a counter and making it visible to all CPUs can > > probably be done in 24 cycles. Accessing it and setting the "seen" flag > > as well might just fit with faster memory. Getting any other useful > > work done while maintaining that rate on a single file seems unlikely. > > This is an interesting idea. > > So, for NFSv4 you'd just mask off the counter bits (and "seen" flag) to > get the ctime, and for the change attribute we'd just mask off the > "seen" flag and put it all in there. Obviously it isn't just NFSv4 that needs the ctime, it is also the vfs... I imagine that the counter would be separate in the in-memory inode. It would be split out when read from storage, and merge in when written to storage. > > * Implementing that for all filesystems would be a huge project though. > If we were implementing the i_version counter from scratch, I'd > probably do something along these lines. Given that we already have > an existing i_version counter, would there be any real benefit to > pursuing this avenue instead? i_version is currently only supported by btrfs, ext4, and xfs. Plus cephfs which has its own internal ideas. So "all filesystems" isn't needed. Let's just start with xfs. All we need is for xfs store in ->i_version a value that meets the semantics that we specify for ->i_version. So we need to change xfs to use somewhere else to store its internal counter that is used for forensics, and then arrange that ->i_version stores the ctime combined with a counter that resets whenever the ctime changes. I think most of this would be done in xfs_vn_update_time(), but probably some changes would be needed in iversion.h to provide useful support. If ext4's current use of i_version provides the semantics that we need, there would be no need to change it. Ditto for btrfs. NeilBrown