On Mon, Nov 16, 2020 at 10:29:29AM -0500, Jeff Layton wrote: > On Fri, 2020-11-13 at 17:26 -0500, bfields wrote: > > On Fri, Nov 13, 2020 at 09:50:50AM -0500, bfields wrote: > > > On Thu, Nov 12, 2020 at 11:05:57PM +0000, Daire Byrne wrote: > > > > So, I can't lay claim to identifying the exact optimisation/hack that > > > > improves the retention of the re-export server's client cache when > > > > re-exporting an NFSv3 server (which is then read by many clients). We > > > > were working with an engineer at the time who showed an interest in > > > > our use case and after we supplied a reproducer he suggested modifying > > > > the nfs/inode.c > > > > > > > > - if (!inode_eq_iversion_raw(inode, fattr->change_attr)) { > > > > + if (inode_peek_iversion_raw(inode) < fattr->change_attr) > > > > { > > > > > > > > His reasoning at the time was: > > > > > > > > "Fixes inode invalidation caused by read access. The least important > > > > bit is ORed with 1 and causes the inode version to differ from the one > > > > seen on the NFS share. This in turn causes unnecessary re-download > > > > impacting the performance significantly. This fix makes it only > > > > re-fetch file content if inode version seen on the server is newer > > > > than the one on the client." > > > > > > > > But I've always been puzzled by why this only seems to be the case > > > > when using knfsd to re-export the (NFSv3) client mount. Using multiple > > > > processes on a standard client mount never causes any similar > > > > re-validations. And this happens with a completely read-only share > > > > which is why I started to think it has something to do with atimes as > > > > that could perhaps still cause a "write" modification even when > > > > read-only? > > > > > > Ah-hah! So, it's inode_query_iversion() that's modifying a nfs inode's > > > i_version. That's a special thing that only nfsd would do. > > > > > > I think that's totally fixable, we'll just have to think a little about > > > how.... > > > > I wonder if something like this helps?--b. > > > > commit 0add88a9ccc5 > > Author: J. Bruce Fields <bfields@xxxxxxxxxx> > > Date: Fri Nov 13 17:03:04 2020 -0500 > > > > nfs: don't mangle i_version on NFS > > > > > > The i_version on NFS has pretty much opaque to the client, so we don't > > want to give the low bit any special interpretation. > > > > > > Define a new FS_PRIVATE_I_VERSION flag for filesystems that manage the > > i_version on their own. > > > > > > Signed-off-by: J. Bruce Fields <bfields@xxxxxxxxxx> > > > > diff --git a/fs/nfs/fs_context.c b/fs/nfs/fs_context.c > > index 29ec8b09a52d..9b8dd5b713a7 100644 > > --- a/fs/nfs/fs_context.c > > +++ b/fs/nfs/fs_context.c > > @@ -1488,7 +1488,8 @@ struct file_system_type nfs_fs_type = { > > .init_fs_context = nfs_init_fs_context, > > .parameters = nfs_fs_parameters, > > .kill_sb = nfs_kill_super, > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA, > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA| > > + FS_PRIVATE_I_VERSION, > > }; > > MODULE_ALIAS_FS("nfs"); > > EXPORT_SYMBOL_GPL(nfs_fs_type); > > @@ -1500,7 +1501,8 @@ struct file_system_type nfs4_fs_type = { > > .init_fs_context = nfs_init_fs_context, > > .parameters = nfs_fs_parameters, > > .kill_sb = nfs_kill_super, > > - .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA, > > + .fs_flags = FS_RENAME_DOES_D_MOVE|FS_BINARY_MOUNTDATA| > > + FS_PRIVATE_I_VERSION, > > }; > > MODULE_ALIAS_FS("nfs4"); > > MODULE_ALIAS("nfs4"); > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 21cc971fd960..c5bb4268228b 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -2217,6 +2217,7 @@ struct file_system_type { > > #define FS_HAS_SUBTYPE 4 > > #define FS_USERNS_MOUNT 8 /* Can be mounted by userns root */ > > #define FS_DISALLOW_NOTIFY_PERM 16 /* Disable fanotify permission events */ > > +#define FS_PRIVATE_I_VERSION 32 /* i_version managed by filesystem */ > > #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */ > > #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */ > > int (*init_fs_context)(struct fs_context *); > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h > > index 2917ef990d43..52c790a847de 100644 > > --- a/include/linux/iversion.h > > +++ b/include/linux/iversion.h > > @@ -307,6 +307,8 @@ inode_query_iversion(struct inode *inode) > > u64 cur, old, new; > > > > > > cur = inode_peek_iversion_raw(inode); > > + if (inode->i_sb->s_type->fs_flags & FS_PRIVATE_I_VERSION) > > + return cur; > > for (;;) { > > /* If flag is already set, then no need to swap */ > > if (cur & I_VERSION_QUERIED) { > > > It's probably more correct to just check the already-existing > SB_I_VERSION flag here So the check would be if (!IS_I_VERSION(inode)) return cur; ? > (though in hindsight a fstype flag might have made more sense). I_VERSION support can vary by superblock (for example, xfs supports it or not depending on on-disk format version). --b.