On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote: > On Thu 22-09-22 16:18:02, Jeff Layton wrote: > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > > e.g. The NFS server can track the i_version values when the > > > > NFSD > > > > syncs/commits a given inode. The nfsd can sample i_version it > > > > when > > > > calls ->commit_metadata or flushed data on the inode, and then > > > > when > > > > it peeks at i_version when gathering post-op attrs (or any > > > > other > > > > getattr op) it can decide that there is too much in-memory > > > > change > > > > (e.g. 10,000 counts since last sync) and sync the inode. > > > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > > uncommitted NFS change attr bumps it allows to build up in > > > > memory. > > > > At that point, the NFS server has a bound "maximum write count" > > > > that > > > > can be used in conjunction with the xattr based crash counter > > > > to > > > > determine how the change_attr is bumped by the crash counter. > > > > > > Well, not "trivially". This is the bit where we have to grow > > > struct > > > inode (or the fs-specific inode), as we'll need to know what the > > > latest > > > on-disk value is for the inode. > > > > > > I'm leaning toward doing this on the query side. Basically, when > > > nfsd > > > goes to query the i_version, it'll check the delta between the > > > current > > > version and the latest one on disk. If it's bigger than X then > > > we'd just > > > return NFS4ERR_DELAY to the client. > > > > > > If the delta is >X/2, maybe it can kick off a workqueue job or > > > something > > > that calls write_inode with WB_SYNC_ALL to try to get the thing > > > onto the > > > platter ASAP. > > > > Still looking at this bit too. Probably we can just kick off a > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the > > best? > > "Hope" is not a great assurance regarding data integrity ;) Anyway, > it > depends on how you imagine the "i_version on disk" is going to be > maintained. It could be maintained by NFSD inside > commit_inode_metadata() - > fetch current i_version value before asking filesystem for the sync > and by the > time commit_metadata() returns we know that value is on disk. If we > detect the > current - on_disk is > X/2, we call commit_inode_metadata() and we > are > done. It is not even *that* expensive because usually filesystems > optimize > away unnecessary IO when the inode didn't change since last time it > got > synced. > Note that these approaches requiring 3rd party help in order to track i_version integrity across filesystem crashes all make the idea of adding i_version to statx() a no-go. It is one thing for knfsd to add specialised machinery for integrity checking, but if all applications need to do so, then they are highly unlikely to want to adopt this attribute. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx