> -----Original Message----- > From: Jeff Layton [mailto:jlayton@xxxxxxxxxx] > Sent: Friday, September 23, 2022 6:50 AM > To: Trond Myklebust <trondmy@xxxxxxxxxxxxxxx>; jack@xxxxxxx > Cc: zohar@xxxxxxxxxxxxx; djwong@xxxxxxxxxx; brauner@xxxxxxxxxx; linux- > xfs@xxxxxxxxxxxxxxx; bfields@xxxxxxxxxxxx; linux-api@xxxxxxxxxxxxxxx; > neilb@xxxxxxx; david@xxxxxxxxxxxxx; fweimer@xxxxxxxxxx; linux- > kernel@xxxxxxxxxxxxxxx; chuck.lever@xxxxxxxxxx; linux-man@xxxxxxxxxxxxxxx; > linux-nfs@xxxxxxxxxxxxxxx; linux-ext4@xxxxxxxxxxxxxxx; tytso@xxxxxxx; > viro@xxxxxxxxxxxxxxxxxx; xiubli@xxxxxxxxxx; linux-fsdevel@xxxxxxxxxxxxxxx; > adilger.kernel@xxxxxxxxx; lczerner@xxxxxxxxxx; ceph-devel@xxxxxxxxxxxxxxx; > linux-btrfs@xxxxxxxxxxxxxxx > Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new > STATX_INO_VERSION field > > On Fri, 2022-09-23 at 13:44 +0000, Trond Myklebust wrote: > > On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote: > > > On Thu 22-09-22 16:18:02, Jeff Layton wrote: > > > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote: > > > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote: > > > > > > e.g. The NFS server can track the i_version values when the > > > > > > NFSD syncs/commits a given inode. The nfsd can sample > > > > > > i_version it when calls ->commit_metadata or flushed data on > > > > > > the inode, and then when it peeks at i_version when gathering > > > > > > post-op attrs (or any other getattr op) it can decide that > > > > > > there is too much in-memory change (e.g. 10,000 counts since > > > > > > last sync) and sync the inode. > > > > > > > > > > > > i.e. the NFS server can trivially cap the maximum number of > > > > > > uncommitted NFS change attr bumps it allows to build up in > > > > > > memory. > > > > > > At that point, the NFS server has a bound "maximum write count" > > > > > > that > > > > > > can be used in conjunction with the xattr based crash counter > > > > > > to determine how the change_attr is bumped by the crash > > > > > > counter. > > > > > > > > > > Well, not "trivially". This is the bit where we have to grow > > > > > struct inode (or the fs-specific inode), as we'll need to know > > > > > what the latest on-disk value is for the inode. > > > > > > > > > > I'm leaning toward doing this on the query side. Basically, when > > > > > nfsd goes to query the i_version, it'll check the delta between > > > > > the current version and the latest one on disk. If it's bigger > > > > > than X then we'd just return NFS4ERR_DELAY to the client. > > > > > > > > > > If the delta is >X/2, maybe it can kick off a workqueue job or > > > > > something that calls write_inode with WB_SYNC_ALL to try to get > > > > > the thing onto the platter ASAP. > > > > > > > > Still looking at this bit too. Probably we can just kick off a > > > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the > > > > best? > > > > > > "Hope" is not a great assurance regarding data integrity ;) Anyway, > > > it depends on how you imagine the "i_version on disk" is going to be > > > maintained. It could be maintained by NFSD inside > > > commit_inode_metadata() - > > > fetch current i_version value before asking filesystem for the sync > > > and by the time commit_metadata() returns we know that value is on > > > disk. If we detect the current - on_disk is > X/2, we call > > > commit_inode_metadata() and we are done. It is not even *that* > > > expensive because usually filesystems optimize away unnecessary IO > > > when the inode didn't change since last time it got synced. > > > > > > > Note that these approaches requiring 3rd party help in order to track > > i_version integrity across filesystem crashes all make the idea of > > adding i_version to statx() a no-go. > > > > It is one thing for knfsd to add specialised machinery for integrity > > checking, but if all applications need to do so, then they are highly > > unlikely to want to adopt this attribute. > > > > > > Absolutely. That is the downside of this approach, but the priority here has > always been to improve nfsd. If we don't get the ability to present this info via > statx, then so be it. Later on, I suppose we can move that handling into the > kernel in some fashion if we decide it's worthwhile. > > That said, not having this in statx makes it more difficult to test i_version > behavior. Maybe we can add a generic ioctl for that in the interim? Having i_version in statx would be really nice for nfs-ganesha. I would consider doing the extra integrity stuff or we may in some cases be able to rely on the filesystem, but in any case, i_version would be an improvement over using ctime or mtime for a change attribute. Frank