On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote: > * Jeff Layton: > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > the filesystem would need to bump s_version_max by the significant > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > to do that. > > > > Would there be a way to ensure that the new s_version_max value has made > > it to disk? Bumping it by a large value and hoping for the best might be > > ok for most cases, but there are always outliers, so it might be > > worthwhile to make an i_version increment wait on that if necessary. > > How common are unclean shutdowns in practice? Do ex64/XFS/btrfs keep > counters in the superblocks for journal replays that can be read easily? > > Several useful i_version applications could be negatively impacted by > frequent i_version invalidation. > One would hope "not very often", but Oopses _are_ something that happens occasionally, even in very stable environments, and it would be best if what we're building can cope with them. Consider: reader writer ---------------------------------------------------------- start with i_version 1 inode updated in memory, i_version++ query, get i_version 2 <<< CRASH : update never makes it to disk, back at 1 after reboot >>> query, get i_version 1 application restarts and redoes write, i_version at 2^40+1 query, get i_version 2^40+1 The main thing we have to avoid here is giving out an i_version that represents two different states of the same inode. This should achieve that. Something else we should consider though is that with enough crashes on a long-lived filesystem, the value could eventually wrap. I think we should acknowledge that fact in advance, and plan to deal with it (particularly if we're going to expose this to userland eventually). Because of the "seen" flag, we have a 63 bit counter to play with. Could we use a similar scheme to the one we use to handle when "jiffies" wraps? Assume that we'd never compare two values that were more than 2^62 apart? We could add i_version_before/i_version_after macros to make it simple to handle this. -- Jeff Layton <jlayton@xxxxxxxxxx>