Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 09 Sep 2022 12:36:29 -0400

On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote:
> On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote:
> > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
> > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > > > Yeah, ok. That does make some sense. So we would mix this into the
> > > > i_version instead of the ctime when it was available. Preferably, we'd
> > > > mix that in when we store the i_version rather than adding it afterward.
> > > > 
> > > > Ted, how would we access this? Maybe we could just add a new (generic)
> > > > super_block field for this that ext4 (and other filesystems) could
> > > > populate at mount time?
> > > 
> > > Couldn't the filesystem just return an ino_version that already includes
> > > it?
> > > 
> > 
> > Yes. That's simple if we want to just fold it in during getattr. If we
> > want to fold that into the values stored on disk, then I'm a little less
> > clear on how that will work.
> > 
> > Maybe I need a concrete example of how that will work:
> > 
> > Suppose we have an i_version value X with the previous crash counter
> > already factored in that makes it to disk. We hand out a newer version
> > X+1 to a client, but that value never makes it to disk.
> > 
> > The machine crashes and comes back up, and we get a query for i_version
> > and it comes back as X. Fine, it's an old version. Now there is a write.
> > What do we do to ensure that the new value doesn't collide with X+1? 
> 
> I was assuming we could partition i_version's 64 bits somehow: e.g., top
> 16 bits store the crash counter.  You increment the i_version by: 1)
> replacing the top bits by the new crash counter, if it has changed, and
> 2) incrementing.
> 
> Do the numbers work out?  2^16 mounts after unclean shutdowns sounds
> like a lot for one filesystem, as does 2^48 changes to a single file,
> but people do weird things.  Maybe there's a better partitioning, or
> some more flexible way of maintaining an i_version that still allows you
> to identify whether a given i_version preceded a crash.
> 

We consume one bit to keep track of the "seen" flag, so it would be a
16+47 split. I assume that we'd also reset the version counter to 0 when
the crash counter changes? Maybe that doesn't matter as long as we don't
overflow into the crash counter.

I'm not sure we can get away with 16 bits for the crash counter, as
it'll leave us subject to the version counter wrapping after a long
uptimes. 

If you increment a counter every nanosecond, how long until that counter
wraps? With 63 bits, that's 292 years (and change). With 16+47 bits,
that's less than two days. An 8+55 split would give us ~416 days which
seems a bit more reasonable?

For NFS, we can probably live with even less bits in the crash counter. 

If the crash counter changes, then that means the NFS server itself has
(likely) also crashed. The client will have to reestablish sockets,
reclaim, etc. It should get new attributes for the inodes it cares about
at that time.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>