Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 16 Sep 2022 11:11:34 -0400

On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > If the answer is that 'all values change', then why store the crash
> > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > you're generating the user-visible change attribute?
> > > > > 
> > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > 
> > I had suggested just hashing the crash counter with the file system's
> > on-disk i_version number, which is essentially what you are suggested.
> > 
> > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > crash, we can do that.
> > > > 
> > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > there wouldn't be more than 4k increments before one of them made it to
> > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > 
> > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > have not have part in calculating the i_version.  It should be entirely
> > > in the filesystem - though support code could be provided if common
> > > patterns exist across filesystems.
> > 
> > Oh, *heck* no.  This parameter is for the NFS implementation to
> > decide, because it's NFS's caching algorithms which are at stake here.
> > 
> > As a the file system maintainer, I had offered to make an on-disk
> > "crash counter" which would get updated when the journal had gotten
> > replayed, in addition to the on-disk i_version number.  This will be
> > available for the Linux implementation of NFSD to use, but that's up
> > to *you* to decide how you want to use them.
> > 
> > I was perfectly happy with hashing the crash counter and the i_version
> > because I had assumed that not *that* much stuff was going to be
> > cached, and so invalidating all of the caches in the unusual case
> > where there was a crash was acceptable.  After all it's a !@#?!@
> > cache.  Caches sometimmes get invalidated.  "That is the order of
> > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > 
> > But if people expect that multiple TB's of data is going to be stored;
> > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > of false negative failures which might cause data corruption might be
> > acceptable tradeoff, hey, that's for the system which is providing
> > caching semantics to determine.
> > 
> > PLEASE don't put this tradeoff on the file system authors; I would
> > much prefer to leave this tradeoff in the hands of the system which is
> > trying to do the caching.
> > 
> 
> Yeah, if we were designing this from scratch, I might agree with leaving
> more up to the filesystem, but the existing users all have pretty much
> the same needs. I'm going to plan to try to keep most of this in the
> common infrastructure defined in iversion.h.
> 
> Ted, for the ext4 crash counter, what wordsize were you thinking? I
> doubt we'll be able to use much more than 32 bits so a larger integer is
> probably not worthwhile. There are several holes in struct super_block
> (at least on x86_64), so adding this field to the generic structure
> needn't grow it.

That said, now that I've taken a swipe at implementing this, I need more
information than just the crash counter. We need to multiply the crash
counter with a reasonable estimate of the maximum number of individual
writes that could occur between an i_version being incremented and that
value making it to the backing store.

IOW, given a write that bumps the i_version to X, how many more write
calls could race in before X makes it to the platter? I took a SWAG and
said 4k in an earlier email, but I don't really have a way to know, and
that could vary wildly with different filesystems and storage.

What I'd like to see is this in struct super_block:

	u32		s_version_offset;

...and then individual filesystems can calculate:

	crash_counter * max_number_of_writes

and put the correct value in there at mount time.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>