On Wed, 2023-11-01 at 10:10 -1000, Linus Torvalds wrote: > On Wed, 1 Nov 2023 at 00:16, Jan Kara <jack@xxxxxxx> wrote: > > > > OK, but is this compatible with the current XFS behavior? AFAICS > > currently > > XFS sets sb->s_time_gran to 1 so timestamps currently stored on > > disk will > > have some mostly random garbage in low bits of the ctime. > > I really *really* don't think we can use ctime as a "i_version" > replacement. The whole fine-granularity patches were well- > intentioned, > but I do think they were broken. > > Note that we can't use ctime as a "i_version" replacement for other > reasons too - you have filesystems like FAT - which people do want to > export - that have a single-second (or is it 2s?) granularity in > reality, even though they report a 1ns value in s_time_gran. > > But here's a suggestion that people may hate, but that might just > work > in practice: > > - get rid of i_version entirely > > - use the "known good" part of ctime as the upper bits of the change > counter (and by "known good" I mean tv_sec - or possibly even "tv_sec > / 2" if that dim FAT memory of mine is right) > > - make the rule be that ctime is *never* updated for atime updates > (maybe that's already true, I didn't check - maybe it needs a new > mount flag for nfsd) > > - have a per-inode in-memory and vfs-internal (entirely invisible to > filesystems) "ctime modification counter" that is *NOT* a timestamp, > and is *NOT* i_version > > - make the rule be that the "ctime modification counter" is always > zero, *EXCEPT* if > (a) I_VERSION_QUERIED is set > AND > (b) the ctime modification doesn't modify the "known good" part > of ctime > > so how the "statx change cookie" ends up being "high bits tv_sec of > ctime, low bits ctime modification cookie", and the end result of > that > is: > > - if all the reads happen after the last write (common case), then > the low bits will be zero, because I_VERSION_QUERIED wasn't set when > ctime was modified > > - if you do a write *after* a modification, the ctime cookie is > guaranteed to change, because either the known good (sec/2sec) part > of > ctime is new, *or* the counter gets updated > > - if the nfs server reboots, the in-memory counter will be cleared > again, and so the change cookie will cause client cache > invalidations, > but *only* for those "ctime changed in the same second _after_ > somebody did a read". > > - any long-time caches of files that don't get modified are all > fine, > because they will have those low bits zero and depend on just the > stable part of ctime that works across filesystems. So there should > be > no nasty thundering herd issues on long-lived caches on lots of > clients if the server reboots, or atime updates every 24 hours or > anything like that. > > and note that *NONE* of this requires any filesystem involvement > (except for the rule of "no atime changes ever impact ctime", which > may or may not already be true). > > The filesystem does *not* know about that modification counter, > there's no new on-disk stable information. > > It's entirely possible that I'm missing something obvious, but the > above sounds to me like the only time you'd have stale invalidations > is really the (unusual) case of having writes after cached reads, and > then a reboot. > > We'd get rid of "inode_maybe_inc_iversion()" entirely, and instead > replace it with logic in inode_set_ctime_current() that basically > does > > - if the stable part of ctime changes, clear the new 32-bit counter > > - if I_VERSION_QUERIED isn't set, clear the new 32-bit counter > > - otherwise, increment the new 32-bit counter > > and then the STATX_CHANGE_COOKIE code basically just returns > > (stable part of ctime << 32) + new 32-bit counter > > (and again, the "stable part of ctime" is either just tv_sec, or it's > "tv_sec >> 1" or whatever). > > The above does not expose *any* changes to timestamps to users, and > should work across a wide variety of filesystems, without requiring > any special code from the filesystem itself. > > And now please all jump on me and say "No, Linus, that won't work, > because XYZ". > > Because it is *entirely* possible that I missed something truly > fundamental, and the above is completely broken for some obvious > reason that I just didn't think of. > My client writes to the file and immediately reads the ctime. A 3rd party client then writes immediately after my ctime read. A reboot occurs (maybe minutes later), then I re-read the ctime, and get the same value as before the 3rd party write. Yes, most of the time that is better than the naked ctime, but not across a reboot. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx