On Tue, 22 Jun 2021, Wang Yugui wrote: > > > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down(). > > btrfs subvol crossmnt begin to work, although buggy. > > some subvol is crossmnt-ed, some subvol is yet not, and some dir is > wrongly crossmnt-ed > > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin > to happen. > > This is the current patch based on 5.10.44. > At least nfsd_follow_up() is buggy. > I don't think the approach you are taking makes sense. Let me explain why. The problem is that applications on the NFS client can see different files or directories on the same (apparent) filesystem with the same inode number. Most application won't care and NFS itself doesn't get confused by the duplicate inode numbers, but 'find' and similar programs (probably 'tar' for example) do get upset. This happens because BTRFS reuses inode numbers in subvols which it presents to the kernel as all part of the one filesystem (or at least, all part of the one mount point). NFSD only sees one filesystem, and so reports the same filesystem-id (fsid) for all objects. The NFS client then sees that the fsid is the same and tells applications that the objects are all in the one filesystem. To fix this, we need to make sure that nfsd reports a different fsid for objects in different subvols. There are two obvious ways to do this. One is to teach nfsd to recognize btrfs subvolumes exactly like separate filesystems (as nfsd already ensure each filesystem gets its own fsid). This is the approach of my first suggestion. It requires changing nfsd_mountpoint() and follow_up() and any other code that is aware of different filesytems. As I mentioned, it also requires changing mountd to be able to extract a list of subvols from btrfs because they don't appear in /proc/mounts. As you might know an NFS filehandle has 3 parts: a header, a filesystem identifier, and an inode identifier. This approach would involve giving different subvols different filesystem identifiers in the filehandle. This, it turns out is a very big change - bigger than I at first imagined. The second obvious approach is to leave the filehandles unchanged and to continue to treat an entire btrfs filesystem as a single filesystem EXCEPT when reporting the fsid to the NFS client. All we *really* need to do is make sure the client sees a different fsid when it enters a part of the filesystem which re-uses inode numbers. This is what my latest patch did. Your patch seems to combine ideas from both approaches. It includes my code to replace the fsid, but also intercepts follow_up etc. This cannot be useful. As I noted when I posted it, there is a problem with my patch. I now understand that problem. When NFS sees that fsid change it needs to create 2 inodes for that directory. One inode will be in the parent filesystem and will be marked as an auto-mount point so that any lookup below that directory will trigger an internal mount. The other inode is the root of the child filesystem. It gets mounted on the first inode. With normal filesystem mounts, there really is an inode in the parent filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID attribute. This fileid will be different from all other inode numbers in the parent filesystem. With BTRFS there is no inode in the parent volume (as far as I know) so there is nothing useful to return for MOUNTED_ON_FILEID. This results in NFS using the same inode number for the inode in the parent filesystem as the inode in the child filesystem. For btrfs, this will be 256. As there is already an inode in the parent filesystem with inum 256, 'find' complains. The following patch addresses this by adding code to nfsd when it determines MOUINTD_ON_FILEID to choose an number that should be unused in btrfs. With this change, 'find' seems to work correctly with NFSv4 mounts of btrfs. This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID attribute - strictly speaking, the NFSv3 protocol doesn't support crossing mount points, though the Linux implementation does allow it. So this patch works and, I think, is the best we can do in terms of functionality. I don't like the details of the implementation though. It requires NFSD to know too much about BTRFS internals. I think I would like btrfs to make it clear where a subvol started, maybe by setting DCACHE_MOUNTED on the dentry. This flag is only a hint, not a promise of anything, so other code should get confused. This would have nfsd calling vfs_statfs quite so often .... maybe that isn't such a big deal. More importantly, there needs to be some way for NFSD to find an inode number to report for the MOUNTED_ON_FILEID. This needs to be a number not used elsewhere in the filesystem. It might be safe to use the same fileid for all subvols (as my patch currently does), but we would need to confirm that 'find' and 'tar' don't complain about that or mishandle it. If it is safe to use the same fileid, then a new field in the superblock to store it might work. If a different fileid is needed, the we might need a new field in 'struct kstatfs', so vfs_statfs can report it. Anyway, here is my current patch. It includes support for NFSv3 as well as NFSv4. NeilBrown diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c index 9421dae22737..790a3357525d 100644 --- a/fs/nfsd/export.c +++ b/fs/nfsd/export.c @@ -15,6 +15,7 @@ #include <linux/slab.h> #include <linux/namei.h> #include <linux/module.h> +#include <linux/statfs.h> #include <linux/exportfs.h> #include <linux/sunrpc/svc_xprt.h> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen) int err; struct auth_domain *dom = NULL; struct svc_export exp = {}, *expp; + struct kstatfs statfs; int an_int; if (mesg[mlen-1] != '\n') @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen) err = kern_path(buf, 0, &exp.ex_path); if (err) goto out1; + err = vfs_statfs(&exp.ex_path, &statfs); + if (err) + goto out3; + exp.ex_fsid64 = statfs.f_fsid; exp.ex_client = dom; exp.cd = cd; @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem) new->ex_anon_uid = item->ex_anon_uid; new->ex_anon_gid = item->ex_anon_gid; new->ex_fsid = item->ex_fsid; + new->ex_fsid64 = item->ex_fsid64; new->ex_devid_map = item->ex_devid_map; item->ex_devid_map = NULL; new->ex_uuid = item->ex_uuid; diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h index ee0e3aba4a6e..d3eb9a599918 100644 --- a/fs/nfsd/export.h +++ b/fs/nfsd/export.h @@ -68,6 +68,7 @@ struct svc_export { kuid_t ex_anon_uid; kgid_t ex_anon_gid; int ex_fsid; + __kernel_fsid_t ex_fsid64; unsigned char * ex_uuid; /* 16 byte fsid */ struct nfsd4_fs_locations ex_fslocs; uint32_t ex_nflavors; diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index 0a5ebc52e6a9..f11ba3434fd6 100644 --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, case FSIDSOURCE_FSID: fsid = (u64)fhp->fh_export->ex_fsid; break; - case FSIDSOURCE_UUID: + case FSIDSOURCE_UUID: { + struct kstatfs statfs; + fsid = ((u64 *)fhp->fh_export->ex_uuid)[0]; fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1]; + if (fh_getstafs(fhp, &statfs) == 0 && + (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] || + statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1])) + /* looks like a btrfs subvol */ + fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1]; break; + } default: fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev); } diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c index 7abeccb975b2..5f614d1b362e 100644 --- a/fs/nfsd/nfs4xdr.c +++ b/fs/nfsd/nfs4xdr.c @@ -42,6 +42,7 @@ #include <linux/sunrpc/svcauth_gss.h> #include <linux/sunrpc/addr.h> #include <linux/xattr.h> +#include <linux/btrfs_tree.h> #include <uapi/linux/xattr.h> #include "idmap.h" @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, if (err) goto out_nfserr; if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE | + FATTR4_WORD0_FSID | FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) || (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE | + FATTR4_WORD1_MOUNTED_ON_FILEID | FATTR4_WORD1_SPACE_TOTAL))) { err = vfs_statfs(&path, &statfs); if (err) @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, case FSIDSOURCE_UUID: p = xdr_encode_opaque_fixed(p, exp->ex_uuid, EX_UUID_LEN); + if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] || + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) { + /* looks like a btrfs subvol */ + p[-2] ^= statfs.f_fsid.val[0]; + p[-1] ^= statfs.f_fsid.val[1]; + } break; } } @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, goto out_nfserr; ino = parent_stat.ino; } + if (fsid_source(fhp) == FSIDSOURCE_UUID && + (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] || + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1])) + /* btrfs subvol pseudo mount point */ + ino = BTRFS_FIRST_FREE_OBJECTID-1; + p = xdr_encode_hyper(p, ino); } #ifdef CONFIG_NFSD_PNFS diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h index b21b76e6b9a8..82b76b0b7bec 100644 --- a/fs/nfsd/vfs.h +++ b/fs/nfsd/vfs.h @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat) AT_STATX_SYNC_AS_STAT)); } +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs) +{ + struct path p = {.mnt = fh->fh_export->ex_path.mnt, + .dentry = fh->fh_dentry}; + return nfserrno(vfs_statfs(&p, statfs)); +} + static inline int nfsd_create_is_exclusive(int createmode) { return createmode == NFS3_CREATE_EXCLUSIVE