On Wed, 1 Nov 2023 at 00:16, Jan Kara <jack@xxxxxxx> wrote: > > OK, but is this compatible with the current XFS behavior? AFAICS currently > XFS sets sb->s_time_gran to 1 so timestamps currently stored on disk will > have some mostly random garbage in low bits of the ctime. I really *really* don't think we can use ctime as a "i_version" replacement. The whole fine-granularity patches were well-intentioned, but I do think they were broken. Note that we can't use ctime as a "i_version" replacement for other reasons too - you have filesystems like FAT - which people do want to export - that have a single-second (or is it 2s?) granularity in reality, even though they report a 1ns value in s_time_gran. But here's a suggestion that people may hate, but that might just work in practice: - get rid of i_version entirely - use the "known good" part of ctime as the upper bits of the change counter (and by "known good" I mean tv_sec - or possibly even "tv_sec / 2" if that dim FAT memory of mine is right) - make the rule be that ctime is *never* updated for atime updates (maybe that's already true, I didn't check - maybe it needs a new mount flag for nfsd) - have a per-inode in-memory and vfs-internal (entirely invisible to filesystems) "ctime modification counter" that is *NOT* a timestamp, and is *NOT* i_version - make the rule be that the "ctime modification counter" is always zero, *EXCEPT* if (a) I_VERSION_QUERIED is set AND (b) the ctime modification doesn't modify the "known good" part of ctime so how the "statx change cookie" ends up being "high bits tv_sec of ctime, low bits ctime modification cookie", and the end result of that is: - if all the reads happen after the last write (common case), then the low bits will be zero, because I_VERSION_QUERIED wasn't set when ctime was modified - if you do a write *after* a modification, the ctime cookie is guaranteed to change, because either the known good (sec/2sec) part of ctime is new, *or* the counter gets updated - if the nfs server reboots, the in-memory counter will be cleared again, and so the change cookie will cause client cache invalidations, but *only* for those "ctime changed in the same second _after_ somebody did a read". - any long-time caches of files that don't get modified are all fine, because they will have those low bits zero and depend on just the stable part of ctime that works across filesystems. So there should be no nasty thundering herd issues on long-lived caches on lots of clients if the server reboots, or atime updates every 24 hours or anything like that. and note that *NONE* of this requires any filesystem involvement (except for the rule of "no atime changes ever impact ctime", which may or may not already be true). The filesystem does *not* know about that modification counter, there's no new on-disk stable information. It's entirely possible that I'm missing something obvious, but the above sounds to me like the only time you'd have stale invalidations is really the (unusual) case of having writes after cached reads, and then a reboot. We'd get rid of "inode_maybe_inc_iversion()" entirely, and instead replace it with logic in inode_set_ctime_current() that basically does - if the stable part of ctime changes, clear the new 32-bit counter - if I_VERSION_QUERIED isn't set, clear the new 32-bit counter - otherwise, increment the new 32-bit counter and then the STATX_CHANGE_COOKIE code basically just returns (stable part of ctime << 32) + new 32-bit counter (and again, the "stable part of ctime" is either just tv_sec, or it's "tv_sec >> 1" or whatever). The above does not expose *any* changes to timestamps to users, and should work across a wide variety of filesystems, without requiring any special code from the filesystem itself. And now please all jump on me and say "No, Linus, that won't work, because XYZ". Because it is *entirely* possible that I missed something truly fundamental, and the above is completely broken for some obvious reason that I just didn't think of. Linus