On Wed, 2022-10-05 at 13:34 +0000, Trond Myklebust wrote: > On Wed, 2022-10-05 at 06:06 -0400, Jeff Layton wrote: > > On Tue, 2022-10-04 at 10:39 +1100, NeilBrown wrote: > > > On Fri, 30 Sep 2022, Jeff Layton wrote: > > > > Now that we can call into vfs_getattr to get the i_version field, > > > > use > > > > that facility to fetch it instead of doing it in > > > > nfsd4_change_attribute. > > > > > > > > Neil also pointed out recently that IS_I_VERSION directory > > > > operations > > > > are always logged, and so we only need to mitigate the rollback > > > > problem > > > > on regular files. Also, we don't need to factor in the ctime when > > > > reexporting NFS or Ceph. > > > > > > > > Set the STATX_VERSION (and BTIME) bits in the request when we're > > > > dealing > > > > with a v4 request. Then, instead of looking at IS_I_VERSION when > > > > generating the change attr, look at the result mask and only use > > > > it if > > > > STATX_VERSION is set. With this change, we can drop the > > > > fetch_iversion > > > > export operation as well. > > > > > > > > Move nfsd4_change_attribute into nfsfh.c, and change it to only > > > > factor > > > > in the ctime if it's a regular file and the fs doesn't advertise > > > > STATX_ATTR_VERSION_MONOTONIC. > > > > > > > > Signed-off-by: Jeff Layton <jlayton@xxxxxxxxxx> > > > > --- > > > > fs/nfs/export.c | 7 ------- > > > > fs/nfsd/nfs4xdr.c | 4 +++- > > > > fs/nfsd/nfsfh.c | 40 > > > > ++++++++++++++++++++++++++++++++++++++++ > > > > fs/nfsd/nfsfh.h | 29 +---------------------------- > > > > fs/nfsd/vfs.h | 7 ++++++- > > > > include/linux/exportfs.h | 1 - > > > > 6 files changed, 50 insertions(+), 38 deletions(-) > > > > > > > > diff --git a/fs/nfs/export.c b/fs/nfs/export.c > > > > index 01596f2d0a1e..1a9d5aa51dfb 100644 > > > > --- a/fs/nfs/export.c > > > > +++ b/fs/nfs/export.c > > > > @@ -145,17 +145,10 @@ nfs_get_parent(struct dentry *dentry) > > > > return parent; > > > > } > > > > > > > > -static u64 nfs_fetch_iversion(struct inode *inode) > > > > -{ > > > > - nfs_revalidate_inode(inode, NFS_INO_INVALID_CHANGE); > > > > - return inode_peek_iversion_raw(inode); > > > > -} > > > > - > > > > const struct export_operations nfs_export_ops = { > > > > .encode_fh = nfs_encode_fh, > > > > .fh_to_dentry = nfs_fh_to_dentry, > > > > .get_parent = nfs_get_parent, > > > > - .fetch_iversion = nfs_fetch_iversion, > > > > .flags = EXPORT_OP_NOWCC|EXPORT_OP_NOSUBTREECHK| > > > > EXPORT_OP_CLOSE_BEFORE_UNLINK|EXPORT_OP_REMOTE_FS > > > > > > > > > EXPORT_OP_NOATOMIC_ATTR, > > > > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > > > > index 1e9690a061ec..779c009314c6 100644 > > > > --- a/fs/nfsd/nfs4xdr.c > > > > +++ b/fs/nfsd/nfs4xdr.c > > > > @@ -2869,7 +2869,9 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, > > > > struct svc_fh *fhp, > > > > goto out; > > > > } > > > > > > > > - err = vfs_getattr(&path, &stat, STATX_BASIC_STATS, > > > > AT_STATX_SYNC_AS_STAT); > > > > + err = vfs_getattr(&path, &stat, > > > > + STATX_BASIC_STATS | STATX_BTIME | > > > > STATX_VERSION, > > > > + AT_STATX_SYNC_AS_STAT); > > > > if (err) > > > > goto out_nfserr; > > > > if (!(stat.result_mask & STATX_BTIME)) > > > > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c > > > > index a5b71526cee0..9168bc657378 100644 > > > > --- a/fs/nfsd/nfsfh.c > > > > +++ b/fs/nfsd/nfsfh.c > > > > @@ -634,6 +634,10 @@ void fh_fill_pre_attrs(struct svc_fh *fhp) > > > > stat.mtime = inode->i_mtime; > > > > stat.ctime = inode->i_ctime; > > > > stat.size = inode->i_size; > > > > + if (v4 && IS_I_VERSION(inode)) { > > > > + stat.version = > > > > inode_query_iversion(inode); > > > > + stat.result_mask |= STATX_VERSION; > > > > + } > > > > > > This is increasingly ugly. I wonder if it is justified at all... > > > > > > > I'm fine with dropping that. So if the getattrs fail, we should just > > not > > offer up pre/post attrs? > > > > > > } > > > > if (v4) > > > > fhp->fh_pre_change = > > > > nfsd4_change_attribute(&stat, inode); > > > > @@ -665,6 +669,8 @@ void fh_fill_post_attrs(struct svc_fh *fhp) > > > > if (err) { > > > > fhp->fh_post_saved = false; > > > > fhp->fh_post_attr.ctime = inode->i_ctime; > > > > + if (v4 && IS_I_VERSION(inode)) > > > > + fhp->fh_post_attr.version = > > > > inode_query_iversion(inode); > > > > > > ... ditto ... > > > > > > > } else > > > > fhp->fh_post_saved = true; > > > > if (v4) > > > > @@ -754,3 +760,37 @@ enum fsid_source fsid_source(const struct > > > > svc_fh *fhp) > > > > return FSIDSOURCE_UUID; > > > > return FSIDSOURCE_DEV; > > > > } > > > > + > > > > +/* > > > > + * We could use i_version alone as the change attribute. > > > > However, i_version > > > > + * can go backwards on a regular file after an unclean > > > > shutdown. On its own > > > > + * that doesn't necessarily cause a problem, but if i_version > > > > goes backwards > > > > + * and then is incremented again it could reuse a value that was > > > > previously > > > > + * used before boot, and a client who queried the two values > > > > might incorrectly > > > > + * assume nothing changed. > > > > + * > > > > + * By using both ctime and the i_version counter we guarantee > > > > that as long as > > > > + * time doesn't go backwards we never reuse an old value. If the > > > > filesystem > > > > + * advertises STATX_ATTR_VERSION_MONOTONIC, then this mitigation > > > > is not needed. > > > > + * > > > > + * We only need to do this for regular files as well. For > > > > directories, we > > > > + * assume that the new change attr is always logged to stable > > > > storage in some > > > > + * fashion before the results can be seen. > > > > + */ > > > > +u64 nfsd4_change_attribute(struct kstat *stat, struct inode > > > > *inode) > > > > +{ > > > > + u64 chattr; > > > > + > > > > + if (stat->result_mask & STATX_VERSION) { > > > > + chattr = stat->version; > > > > + > > > > + if (S_ISREG(inode->i_mode) && > > > > + !(stat->attributes & > > > > STATX_ATTR_VERSION_MONOTONIC)) { > > > > > > I would really rather that the fs got to make this decision. > > > If it can guarantee that the i_version is monotonic even over a > > > crash > > > (which is probably can for directory, and might need changes to do > > > for > > > files) then it sets STATX_ATTR_VERSION_MONOTONIC and nfsd trusts it > > > completely. > > > If it cannot, then it doesn't set the flag. > > > i.e. the S_ISREG() test should be in the filesystem, not in nfsd. > > > > > > > This sounds reasonable, but for one thing. > > > > From RFC 7862: > > > > While Section 5.4 of [RFC5661] discusses > > per-file system attributes, it is expected that the value of > > change_attr_type will not depend on the value of "homogeneous" and > > will only change in the event of a migration. > > > > The change_attr_type4 must be the same for all filehandles under a > > particular filesystem. > > > > If we do what you suggest though, then it's easily possible for the > > fs > > to set STATX_ATTR_VERSION_MONOTONIC on directories but not files. If > > we > > later want to allow nfsd to advertise a change_attr_type4, we won't > > be > > able to rely on the STATX_ATTR_VERSION_MONOTONIC to tell us how to > > fill > > that out. > > That will break clients. So no, that's not acceptable. > Yeah. This is why I mentioned that this flag would have been better advertised via fsinfo(), had that been a thing. One option is to just document that an fs must advertise the same flag value for all inodes. Alternately, we could allow the fs to set the STATX_ATTR_* flag with per-inode granularity, and for nfsd, just add a new change_attr_type() op to export_operations. Most filesystems would just have that return a hardcoded value, but an nfs reexport could just pass through whatever value it got from the server. -- Jeff Layton <jlayton@xxxxxxxxxx>