On Tue, 13 Sep 2022, Dave Chinner wrote: > On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote: > > On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote: > > > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote: > > > Our goal is to ensure that after a crash, any *new* i_versions that we > > > give out or write to disk are larger than any that have previously been > > > given out. We can do that by ensuring that they're equal to at least > > > that old maximum. > > > > > > So think of the 64-bit value we're storing in the superblock as a > > > ceiling on i_version values across all the filesystem's inodes. Call it > > > s_version_max or something. We also need to know what the maximum was > > > before the most recent crash. Call that s_version_max_old. > > > > > > Then we could get correct behavior if we generated i_versions with > > > something like: > > > > > > i_version++; > > > if (i_version < s_version_max_old) > > > i_version = s_version_max_old; > > > if (i_version > s_version_max) > > > s_version_max = i_version + 1; > > > > > > But that last step makes this ludicrously expensive, because for this to > > > be safe across crashes we need to update that value on disk as well, and > > > we need to do that frequently. > > > > > > Fortunately, s_version_max doesn't have to be a tight bound at all. We > > > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at > > > a time. And recognize when we're running up against it way ahead of > > > time, so we only need to say "here's an updated value, could you please > > > make sure it gets to disk sometime in the next twenty minutes"? > > > (Numbers made up.) > > > > > > Sorry, that was way too many words. But I think something like that > > > could work, and make it very difficult to hit any hard limits, and > > > actually not be too complicated?? Unless I missed something. > > > > > > > That's not too many words -- I appreciate a good "for dummies" > > explanation! > > > > A scheme like that could work. It might be hard to do it without a > > spinlock or something, but maybe that's ok. Thinking more about how we'd > > implement this in the underlying filesystems: > > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory > > superblocks for ext4, xfs and btrfs. On the first mount after a crash, > > the filesystem would need to bump s_version_max by the significant > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need > > to do that. > > Why only increment on crash? If the filesystem has been unmounted, > then any cached data is -stale- and must be discarded. e.g. unmount, > run fsck which cleans up corrupt files but does not modify > i_version, then mount. Remote caches are now invalid, but i_version > may not have changed, so we still need the clean unmount-mount cycle > to invalidate caches. I disagree. We do need fsck to cause caches to be invalidated IF IT FOUND SOMETHING TO REPAIR, but not if the filesystem was truely clean. > > IOWs, what we want is a salted i_version value, with the filesystem > providing the unique per-mount salt that gets added to the > externally visible i_version values. I agree this is a simple approach. Possible the best. > > If that's the case, the salt doesn't need to be restricted to just > modifying the upper bits - as long as the salt increments > substantially and independently to the on-disk inode i_version then > we just don't care what bits of the superblock salt change from > mount to mount. > > For XFS we already have a unique 64 bit salt we could use for every > mount - clean or unclean - and guarantee it is larger for every > mount. It also gets substantially bumped by fsck, too. It's called a > Log Sequence Number and we use them to track and strictly order > every modification we write into the log. This is exactly what is > needed for a i_version salt, and it's already guaranteed to be > persistent. Invalidating the client cache on EVERY unmount/mount could impose unnecessary cost. Imagine a client that caches a lot of data (several large files) from a server which is expected to fail-over from one cluster node to another from time to time. Adding extra delays to a fail-over is not likely to be well received. I don't *know* this cost would be unacceptable, and I *would* like to leave it to the filesystem to decide how to manage its own i_version values. So maybe XFS can use the LSN for a salt. If people notice the extra cost, they can complain. Thanks, NeilBrown > > > Would there be a way to ensure that the new s_version_max value has made > > it to disk? > > Yes, but that's not really relevant to the definition of the salt: > we don't need to design the filesystem implementation of a > persistent per-mount salt value. All we need is to define the > behaviour of the salt (e.g. must always increase across a > umount/mount cycle) and then you can let the filesystem developers > worry about how to provide the required salt behaviour and it's > persistence. > > In the mean time, you can implement the salting and testing it by > using the system time to seed the superblock salt - that's good > enough for proof of concept, and as a fallback for filesystems that > cannot provide the required per-mount salt persistence.... > > > Bumping it by a large value and hoping for the best might be > > ok for most cases, but there are always outliers, so it might be > > worthwhile to make an i_version increment wait on that if necessary. > > Nothing should be able to query i_version until the filesystem is > fully recovered, mounted and the salt has been set. Hence no > application (kernel or userspace) should ever see an unsalted > i_version value.... > > -Dave. > -- > Dave Chinner > david@xxxxxxxxxxxxx >