On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote: > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote: > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote: > > > > > If the answer is that 'all values change', then why store the crash > > > > > counter in the inode at all? Why not just add it as an offset when > > > > > you're generating the user-visible change attribute? > > > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset) > > > > I had suggested just hashing the crash counter with the file system's > > on-disk i_version number, which is essentially what you are suggested. > > > > > > Yes, if we plan to ensure that all the change attrs change after a > > > > crash, we can do that. > > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that > > > > there wouldn't be more than 4k increments before one of them made it to > > > > disk. OTOH, maybe that can happen with teeny-tiny writes. > > > > > > Leave it up the to filesystem to decide. The VFS and/or NFSD should > > > have not have part in calculating the i_version. It should be entirely > > > in the filesystem - though support code could be provided if common > > > patterns exist across filesystems. > > > > Oh, *heck* no. This parameter is for the NFS implementation to > > decide, because it's NFS's caching algorithms which are at stake here. > > > > As a the file system maintainer, I had offered to make an on-disk > > "crash counter" which would get updated when the journal had gotten > > replayed, in addition to the on-disk i_version number. This will be > > available for the Linux implementation of NFSD to use, but that's up > > to *you* to decide how you want to use them. > > > > I was perfectly happy with hashing the crash counter and the i_version > > because I had assumed that not *that* much stuff was going to be > > cached, and so invalidating all of the caches in the unusual case > > where there was a crash was acceptable. After all it's a !@#?!@ > > cache. Caches sometimmes get invalidated. "That is the order of > > things." (as Ramata'Klan once said in "Rocks and Shoals") > > > > But if people expect that multiple TB's of data is going to be stored; > > that cache invalidation is unacceptable; and that a itsy-weeny chance > > of false negative failures which might cause data corruption might be > > acceptable tradeoff, hey, that's for the system which is providing > > caching semantics to determine. > > > > PLEASE don't put this tradeoff on the file system authors; I would > > much prefer to leave this tradeoff in the hands of the system which is > > trying to do the caching. > > > > Yeah, if we were designing this from scratch, I might agree with leaving > more up to the filesystem, but the existing users all have pretty much > the same needs. I'm going to plan to try to keep most of this in the > common infrastructure defined in iversion.h. > > Ted, for the ext4 crash counter, what wordsize were you thinking? I > doubt we'll be able to use much more than 32 bits so a larger integer is > probably not worthwhile. There are several holes in struct super_block > (at least on x86_64), so adding this field to the generic structure > needn't grow it. That said, now that I've taken a swipe at implementing this, I need more information than just the crash counter. We need to multiply the crash counter with a reasonable estimate of the maximum number of individual writes that could occur between an i_version being incremented and that value making it to the backing store. IOW, given a write that bumps the i_version to X, how many more write calls could race in before X makes it to the platter? I took a SWAG and said 4k in an earlier email, but I don't really have a way to know, and that could vary wildly with different filesystems and storage. What I'd like to see is this in struct super_block: u32 s_version_offset; ...and then individual filesystems can calculate: crash_counter * max_number_of_writes and put the correct value in there at mount time. -- Jeff Layton <jlayton@xxxxxxxxxx>