Re: [man-pages RFC PATCH v4] statx, inode: document the new STATX_INO_VERSION field

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 22 Sep 2022 16:18:02 -0400

On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote:
> > > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote:
> > > > > How do we determine what that offset should be? Your last email
> > > > > suggested that there really is no limit to the number of i_version bumps
> > > > > that can happen in memory before one of them makes it to disk. What can
> > > > > we do to address that?
> > > > 
> > > > <shrug>
> > > > 
> > > > I'm just pointing out problems I see when defining this as behaviour
> > > > for on-disk format purposes. If we define it as part of the on-disk
> > > > format, then we have to be concerned about how it may be used
> > > > outside the scope of just the NFS server application. 
> > > > 
> > > > However, If NFS keeps this metadata and functionaly entirely
> > > > contained at the application level via xattrs, I really don't care
> > > > what algorithm NFS developers decides to use for their crash
> > > > sequencing. It's not my concern at this point, and that's precisely
> > > > why NFS should be using xattrs for this NFS specific functionality.
> > > > 
> > > 
> > > I get it: you'd rather not have to deal with what you see as an NFS
> > > problem, but I don't get how what you're proposing solves anything. We
> > > might be able to use that scheme to detect crashes, but that's only part
> > > of the problem (and it's a relatively simple part of the problem to
> > > solve, really).
> > > 
> > > Maybe you can clarify it for me:
> > > 
> > > Suppose we go with what you're saying and store some information in
> > > xattrs that allows us to detect crashes in some fashion. The server
> > > crashes and comes back up and we detect that there was a crash earlier.
> > > 
> > > What does nfsd need to do now to ensure that it doesn't hand out a
> > > duplicate change attribute? 
> > 
> > As I've already stated, the NFS server can hold the persistent NFS
> > crash counter value in a second xattr that it bumps whenever it
> > detects a crash and hence we take the local filesystem completely
> > out of the equation.  How the crash counter is then used by the nfsd
> > to fold it into the NFS protocol change attribute is a nfsd problem,
> > not a local filesystem problem.
> > 
> 
> Ok, assuming you mean put this in an xattr that lives at the root of the
> export? We only need this for IS_I_VERSION filesystems (btrfs, xfs, and
> ext4), and they all support xattrs so this scheme should work.
> 

I had a look at this today and it's not as straightforward as it
sounds. 

In particular, there is no guarantee that an export will not cross
filesystem boundaries. Also, nfsd and mountd are very much "demand
driven". We might not touch an exported filesystem at all if nothing
asks for it. Ensuring we can do something to every exported filesystem
after a crash is more difficult than it sounds.

So trying to do something with xattrs on the exported filesystems is
probably not what we want. It's also sort of janky since we do strive to
leave a "light footprint" on the exported filesystem.

Maybe we don't need that though. Chuck reminded me that nfsdcltrack
could be used here instead. We can punt this to userland!

nfsdcltrack could keep track of a global crash "salt", and feed that to
nfsd when it starts up. When starting a grace period, it can set a
RUNNING flag in the db. If it's set when the server starts, we know
there was a crash and can bump the crash counter. When nfsd is shutting
down cleanly, it can call sync() and then clear the flag (this may
require a new cld upcall cmd). We then mix that value into the change
attribute for IS_I_VERSION inodes.

That's probably good enough for nfsd, but if we wanted to present this
to userland via statx, we'd need a different mechanism. For now, I'm
going to plan to fix this up in nfsd and then we'll see where we are.

> > If you're worried about maximum number of writes outstanding vs
> > i_version bumps that are held in memory, then *bound the maximum
> > number of uncommitted i_version changes that the NFS server will
> > allow to build up in memory*. By moving the crash counter to being a
> > NFS server only function, the NFS server controls the entire
> > algorithm and it doesn't have to care about external 3rd party
> > considerations like local filesystems have to.
> > 
> 
> Yeah, this is the bigger consideration.
> 
> > e.g. The NFS server can track the i_version values when the NFSD
> > syncs/commits a given inode. The nfsd can sample i_version it when
> > calls ->commit_metadata or flushed data on the inode, and then when
> > it peeks at i_version when gathering post-op attrs (or any other
> > getattr op) it can decide that there is too much in-memory change
> > (e.g. 10,000 counts since last sync) and sync the inode.
> > 
> > i.e. the NFS server can trivially cap the maximum number of
> > uncommitted NFS change attr bumps it allows to build up in memory.
> > At that point, the NFS server has a bound "maximum write count" that
> > can be used in conjunction with the xattr based crash counter to
> > determine how the change_attr is bumped by the crash counter.
> 
> Well, not "trivially". This is the bit where we have to grow struct
> inode (or the fs-specific inode), as we'll need to know what the latest
> on-disk value is for the inode.
> 
> I'm leaning toward doing this on the query side. Basically, when nfsd
> goes to query the i_version, it'll check the delta between the current
> version and the latest one on disk. If it's bigger than X then we'd just
> return NFS4ERR_DELAY to the client.
> 
> If the delta is >X/2, maybe it can kick off a workqueue job or something
> that calls write_inode with WB_SYNC_ALL to try to get the thing onto the
> platter ASAP.

Still looking at this bit too. Probably we can just kick off a
WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best?
-- 
Jeff Layton <jlayton@xxxxxxxxxx>