On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote: > > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote: > > > > How do we determine what that offset should be? Your last email > > > > suggested that there really is no limit to the number of i_version bumps > > > > that can happen in memory before one of them makes it to disk. What can > > > > we do to address that? > > > > > > <shrug> > > > > > > I'm just pointing out problems I see when defining this as behaviour > > > for on-disk format purposes. If we define it as part of the on-disk > > > format, then we have to be concerned about how it may be used > > > outside the scope of just the NFS server application. > > > > > > However, If NFS keeps this metadata and functionaly entirely > > > contained at the application level via xattrs, I really don't care > > > what algorithm NFS developers decides to use for their crash > > > sequencing. It's not my concern at this point, and that's precisely > > > why NFS should be using xattrs for this NFS specific functionality. > > > > > > > I get it: you'd rather not have to deal with what you see as an NFS > > problem, but I don't get how what you're proposing solves anything. We > > might be able to use that scheme to detect crashes, but that's only part > > of the problem (and it's a relatively simple part of the problem to > > solve, really). > > > > Maybe you can clarify it for me: > > > > Suppose we go with what you're saying and store some information in > > xattrs that allows us to detect crashes in some fashion. The server > > crashes and comes back up and we detect that there was a crash earlier. > > > > What does nfsd need to do now to ensure that it doesn't hand out a > > duplicate change attribute? > > As I've already stated, the NFS server can hold the persistent NFS > crash counter value in a second xattr that it bumps whenever it > detects a crash and hence we take the local filesystem completely > out of the equation. How the crash counter is then used by the nfsd > to fold it into the NFS protocol change attribute is a nfsd problem, > not a local filesystem problem. > Ok, assuming you mean put this in an xattr that lives at the root of the export? We only need this for IS_I_VERSION filesystems (btrfs, xfs, and ext4), and they all support xattrs so this scheme should work. > If you're worried about maximum number of writes outstanding vs > i_version bumps that are held in memory, then *bound the maximum > number of uncommitted i_version changes that the NFS server will > allow to build up in memory*. By moving the crash counter to being a > NFS server only function, the NFS server controls the entire > algorithm and it doesn't have to care about external 3rd party > considerations like local filesystems have to. > Yeah, this is the bigger consideration. > e.g. The NFS server can track the i_version values when the NFSD > syncs/commits a given inode. The nfsd can sample i_version it when > calls ->commit_metadata or flushed data on the inode, and then when > it peeks at i_version when gathering post-op attrs (or any other > getattr op) it can decide that there is too much in-memory change > (e.g. 10,000 counts since last sync) and sync the inode. > > i.e. the NFS server can trivially cap the maximum number of > uncommitted NFS change attr bumps it allows to build up in memory. > At that point, the NFS server has a bound "maximum write count" that > can be used in conjunction with the xattr based crash counter to > determine how the change_attr is bumped by the crash counter. Well, not "trivially". This is the bit where we have to grow struct inode (or the fs-specific inode), as we'll need to know what the latest on-disk value is for the inode. I'm leaning toward doing this on the query side. Basically, when nfsd goes to query the i_version, it'll check the delta between the current version and the latest one on disk. If it's bigger than X then we'd just return NFS4ERR_DELAY to the client. If the delta is >X/2, maybe it can kick off a workqueue job or something that calls write_inode with WB_SYNC_ALL to try to get the thing onto the platter ASAP. -- Jeff Layton <jlayton@xxxxxxxxxx>