On Thu, Aug 26, 2021 at 04:03:04PM +1000, NeilBrown wrote: > > [[ Hi Bruce and Chuck, > I've rebased this patch on the earlier patch I sent which allows > me to use the name "fh_flags". I've also added a missing #include. > I've added the 'Acked-by' which Joesf provided earlier for the > btrfs-part. I don't have an 'ack' for the stat.h part, but no-one > has complained wither. > I think it is as ready as it can be, and am keen to know what you > think. > I'm not *very* keen on testing s_magic in nfsd code (though we > already have a couple of such tests in nfs3proc.c), but it does have > the advantage of ensuring that no other filesystem can use this > functionality without landing a patch in fs/nfsd/. > > Thanks for any review that you can provide, > NeilBrown > ]] This seems hairy, but *somebody* has hated every single solution proposed, so, argh, I don't know, maybe it's best. There was a ton of "but why can't we just..." in previous threads, could we include URLs for those and/or the lwn articles? E.g.: https://lore.kernel.org/linux-nfs/162742539595.32498.13687924366155737575.stgit@noble.brown/#b https://lwn.net/Articles/866709/ > BTRFS does not provide unique inode numbers across a filesystem. > It only provides unique inode numbers within a subvolume and > uses synthetic device numbers for different subvolumes to ensure > uniqueness for device+inode. > > nfsd cannot use these varying synthetic device numbers. If nfsd were to > synthesise different stable filesystem ids to give to the client, that > would cause subvolumes to appear in the mount table on the client, even > though they don't appear in the mount table on the server. Also, NFSv3 > doesn't support changing the filesystem id without a new explicit mount > on the client (this is partially supported in practice, but violates the > protocol specification and has problems in some edge cases). > > So currently, the roots of all subvolumes report the same inode number > in the same filesystem to NFS clients and tools like 'find' notice that > a directory has the same identity as an ancestor, and so refuse to > enter that directory. > > This patch allows btrfs (or any filesystem) to provide a 64bit number > that can be xored with the inode number to make the number more unique. > Rather than the client being certain to see duplicates, with this patch > it is possible but extremely rare. > > The number that btrfs provides is a swab64() version of the subvolume > identifier. This has most entropy in the high bits (the low bits of the > subvolume identifer), while the inode has most entropy in the low bits. > The result will always be unique within a subvolume, and will almost > always be unique across the filesystem. > > If an upgrade of the NFS server caused all inode numbers in an exportfs > BTRFS filesystem to appear to the client to change, the client may not > handle this well. The Linux client will cause any open files to become > 'stale'. If the mount point changed inode number, the whole mount would > become inaccessible. > > To avoid this, an unused byte in the filehandle (fh_auth) has been > repurposed as "fh_flags". The new behaviour of uniquifying inode number > is only activated when a new flag is set. > > NFSD will only set this flag in filehandles it reports if the filehandle > of the parent (provided by the client) contains the bit, or if > - the filehandle for the parent is not provided or is for a different > export and > - the filehandle refers to a BTRFS filesystem. > > Thus if you have a BTRFS filesystem originally mounted from a server > without this patch, the flag will never be set and the current behaviour > will continue. Only once you re-mount the filesystem (or the filesystem > is re-auto-mounted) will the inode numbers change. When that happens, > it is likely that the filesystem st_dev number seen on the client will > change anyway. > > Acked-by: Josef Bacik <josef@xxxxxxxxxxxxxx> (for BTFS change) s/BTFS/BTRFS/. > Signed-off-by: NeilBrown <neilb@xxxxxxx> > --- > fs/btrfs/inode.c | 4 ++++ > fs/nfsd/nfs3xdr.c | 15 ++++++++++++++- > fs/nfsd/nfs4xdr.c | 7 ++++--- > fs/nfsd/nfsfh.c | 14 ++++++++++++-- > fs/nfsd/nfsfh.h | 34 +++++++++++++++++++++++++++++++--- > fs/nfsd/xdr3.h | 2 ++ > include/linux/stat.h | 18 ++++++++++++++++++ > 7 files changed, 85 insertions(+), 9 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 06f9f167222b..d35e2a30f25f 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9195,6 +9195,10 @@ static int btrfs_getattr(struct user_namespace *mnt_userns, > generic_fillattr(&init_user_ns, inode, stat); > stat->dev = BTRFS_I(inode)->root->anon_dev; > > + if (BTRFS_I(inode)->root->root_key.objectid != BTRFS_FS_TREE_OBJECTID) > + stat->ino_uniquifier = > + swab64(BTRFS_I(inode)->root->root_key.objectid); > + > spin_lock(&BTRFS_I(inode)->lock); > delalloc_bytes = BTRFS_I(inode)->new_delalloc_bytes; > inode_bytes = inode_get_bytes(inode); > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 3d37923afb06..5e2d5c352ecd 100644 > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -340,6 +340,7 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > { > struct user_namespace *userns = nfsd_user_namespace(rqstp); > __be32 *p; > + u64 ino; > u64 fsid; > > p = xdr_reserve_space(xdr, XDR_UNIT * 21); > @@ -377,7 +378,8 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > p = xdr_encode_hyper(p, fsid); > > /* fileid */ > - p = xdr_encode_hyper(p, stat->ino); > + ino = nfsd_uniquify_ino(fhp, stat); > + p = xdr_encode_hyper(p, ino); > > p = encode_nfstime3(p, &stat->atime); > p = encode_nfstime3(p, &stat->mtime); > @@ -1151,6 +1153,17 @@ svcxdr_encode_entry3_common(struct nfsd3_readdirres *resp, const char *name, > if (xdr_stream_encode_item_present(xdr) < 0) > return false; > /* fileid */ > + if (!resp->dir_have_uniquifier) { > + struct kstat stat; > + if (fh_getattr(&resp->fh, &stat) == nfs_ok) > + resp->dir_ino_uniquifier = > + nfsd_ino_uniquifier(&resp->fh, &stat); > + else > + resp->dir_ino_uniquifier = 0; > + resp->dir_have_uniquifier = true; This took me a minute. So we're assuming the uniquifier stays the same across a directory and its children (because you can't hard link across subvolumes), and this code is just caching the uniquifier for use across the directory--is that right? > + } > + if (resp->dir_ino_uniquifier != ino) > + ino ^= resp->dir_ino_uniquifier; I guess this check (here and in nfsd_uniquify_ino) is just to prevent returning inode number zero? --b. > if (xdr_stream_encode_u64(xdr, ino) < 0) > return false; > /* name */ > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index a54b2845473b..6e31f6286e0b 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -3114,10 +3114,11 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > fhp->fh_handle.fh_size); > } > if (bmval0 & FATTR4_WORD0_FILEID) { > + u64 ino = nfsd_uniquify_ino(fhp, &stat); > p = xdr_reserve_space(xdr, 8); > if (!p) > goto out_resource; > - p = xdr_encode_hyper(p, stat.ino); > + p = xdr_encode_hyper(p, ino); > } > if (bmval0 & FATTR4_WORD0_FILES_AVAIL) { > p = xdr_reserve_space(xdr, 8); > @@ -3274,7 +3275,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > > p = xdr_reserve_space(xdr, 8); > if (!p) > - goto out_resource; > + goto out_resource; > /* > * Get parent's attributes if not ignoring crossmount > * and this is the root of a cross-mounted filesystem. > @@ -3284,7 +3285,7 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > err = get_parent_attributes(exp, &parent_stat); > if (err) > goto out_nfserr; > - ino = parent_stat.ino; > + ino = nfsd_uniquify_ino(fhp, &parent_stat); > } > p = xdr_encode_hyper(p, ino); > } > diff --git a/fs/nfsd/nfsfh.c b/fs/nfsd/nfsfh.c > index 7695c0f1eefe..18bb139f8bfe 100644 > --- a/fs/nfsd/nfsfh.c > +++ b/fs/nfsd/nfsfh.c > @@ -9,6 +9,7 @@ > */ > > #include <linux/exportfs.h> > +#include <linux/magic.h> > > #include <linux/sunrpc/svcauth_gss.h> > #include "nfsd.h" > @@ -173,7 +174,7 @@ static __be32 nfsd_set_fh_dentry(struct svc_rqst *rqstp, struct svc_fh *fhp) > > if (--data_left < 0) > return error; > - if (fh->fh_auth_type != 0) > + if ((fh->fh_flags & ~NFSD_FH_FLAG_ALL) != 0) > return error; > len = key_len(fh->fh_fsid_type) / 4; > if (len == 0) > @@ -532,6 +533,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry, > > struct inode * inode = d_inode(dentry); > dev_t ex_dev = exp_sb(exp)->s_dev; > + u8 flags = 0; > > dprintk("nfsd: fh_compose(exp %02x:%02x/%ld %pd2, ino=%ld)\n", > MAJOR(ex_dev), MINOR(ex_dev), > @@ -548,6 +550,14 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry, > /* If we have a ref_fh, then copy the fh_no_wcc setting from it. */ > fhp->fh_no_wcc = ref_fh ? ref_fh->fh_no_wcc : false; > > + if (ref_fh && ref_fh->fh_export == exp) { > + flags = ref_fh->fh_handle.fh_flags; > + } else { > + /* Set flags as needed */ > + if (exp->ex_path.mnt->mnt_sb->s_magic == BTRFS_SUPER_MAGIC) > + flags |= NFSD_FH_FLAG_INO_UNIQUIFY; > + } > + > if (ref_fh == fhp) > fh_put(ref_fh); > > @@ -565,7 +575,7 @@ fh_compose(struct svc_fh *fhp, struct svc_export *exp, struct dentry *dentry, > > fhp->fh_handle.fh_size = > key_len(fhp->fh_handle.fh_fsid_type) + 4; > - fhp->fh_handle.fh_auth_type = 0; > + fhp->fh_handle.fh_flags = flags; > > mk_fsid(fhp->fh_handle.fh_fsid_type, > fhp->fh_handle.fh_fsid, > diff --git a/fs/nfsd/nfsfh.h b/fs/nfsd/nfsfh.h > index f36234c474dc..bbc7ddd34143 100644 > --- a/fs/nfsd/nfsfh.h > +++ b/fs/nfsd/nfsfh.h > @@ -18,11 +18,17 @@ > * The file handle starts with a sequence of four-byte words. > * The first word contains a version number (1) and three descriptor bytes > * that tell how the remaining 3 variable length fields should be handled. > - * These three bytes are auth_type, fsid_type and fileid_type. > + * These three bytes are flags, fsid_type and fileid_type. > * > * All four-byte values are in host-byte-order. > * > - * The auth_type field is deprecated and must be set to 0. > + * The flags field (previously auth_type) can be used when nfsd behaviour > + * needs to change in a non-compatible way, usually for some specific > + * filesystem. Flags should only be set in filehandles for filesystems which > + * need them. > + * Current values: > + * 1 - BTRFS only. Cause stat->ino_uniquifier to be used to improve inode > + * number uniqueness. > * > * The fsid_type identifies how the filesystem (or export point) is > * encoded. > @@ -53,7 +59,7 @@ struct knfsd_fh { > __u32 fh_raw[NFS4_FHSIZE/4]; > struct { > __u8 fh_version; /* == 1 */ > - __u8 fh_auth_type; /* deprecated */ > + __u8 fh_flags; > __u8 fh_fsid_type; > __u8 fh_fileid_type; > __u32 fh_fsid[]; /* flexible-array member */ > @@ -131,6 +137,28 @@ enum fsid_source { > }; > extern enum fsid_source fsid_source(const struct svc_fh *fhp); > > +enum nfsd_fh_flags { > + NFSD_FH_FLAG_INO_UNIQUIFY = 1, /* BTRFS only */ > + > + NFSD_FH_FLAG_ALL = 1 > +}; > + > +static inline u64 nfsd_ino_uniquifier(const struct svc_fh *fhp, > + const struct kstat *stat) > +{ > + if (fhp->fh_handle.fh_flags & NFSD_FH_FLAG_INO_UNIQUIFY) > + return stat->ino_uniquifier; > + return 0; > +} > + > +static inline u64 nfsd_uniquify_ino(const struct svc_fh *fhp, > + const struct kstat *stat) > +{ > + u64 u = nfsd_ino_uniquifier(fhp, stat); > + if (u != stat->ino) > + return stat->ino ^ u; > + return stat->ino; > +} > > /* > * This might look a little large to "inline" but in all calls except > diff --git a/fs/nfsd/xdr3.h b/fs/nfsd/xdr3.h > index 933008382bbe..d9b6c8314bbb 100644 > --- a/fs/nfsd/xdr3.h > +++ b/fs/nfsd/xdr3.h > @@ -179,6 +179,8 @@ struct nfsd3_readdirres { > struct xdr_buf dirlist; > struct svc_fh scratch; > struct readdir_cd common; > + u64 dir_ino_uniquifier; > + bool dir_have_uniquifier; > unsigned int cookie_offset; > struct svc_rqst * rqstp; > > diff --git a/include/linux/stat.h b/include/linux/stat.h > index fff27e603814..0f3f74d302f8 100644 > --- a/include/linux/stat.h > +++ b/include/linux/stat.h > @@ -46,6 +46,24 @@ struct kstat { > struct timespec64 btime; /* File creation time */ > u64 blocks; > u64 mnt_id; > + /* > + * BTRFS does not provide unique inode numbers within a filesystem, > + * depending on a synthetic 'dev' to provide uniqueness. > + * NFSd cannot make use of this 'dev' number so clients often see > + * duplicate inode numbers. > + * For BTRFS, 'ino' is unlikely to use the high bits until the filesystem > + * has created a great many inodes. > + * It puts another number in ino_uniquifier which: > + * - has most entropy in the high bits > + * - is different precisely when 'dev' is different > + * - is stable across unmount/remount > + * NFSd can xor this with 'ino' to get a substantially more unique > + * number for reporting to the client. > + * The ino_uniquifier for a directory can reasonably be applied > + * to inode numbers reported by the readdir filldir callback. > + * It is NOT currently exported to user-space. > + */ > + u64 ino_uniquifier; > }; > > #endif > -- > 2.32.0 > >