Hi, This patch works very well. Thanks a lot. - crossmnt of btrfs subvol works as expected. - nfs/umount subvol works well. - pseudo mount point inode(255) is good. I test it in 5.10.45 with a few minor rebase. ( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch, just fs/nfsd/nfs3xdr.c rebase) But when I tested it with another btrfs system without subvol but with more data, 'find /nfs/test' caused a OOPS . and this OOPS will not happen just without this patch. The data in this filesystem is created/left by xfstest(FSTYP=nfs, TEST_DEV). #nfs4 option: default mount.nfs4, nfs-utils-2.3.3 # access btrfs directly $ find /mnt/test | wc -l 6612 # access btrfs through nfs $ find /nfs/test | wc -l [ 466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004 [ 466.172123] #PF: supervisor read access in kernel mode [ 466.177857] #PF: error_code(0x0000) - not-present page [ 466.183601] PGD 0 P4D 0 [ 466.186443] Oops: 0000 [#1] SMP NOPTI [ 466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1 [ 466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019 [ 466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd] [ 466.212067] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83 [ 466.233061] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246 [ 466.238894] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00 [ 466.246872] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 466.254848] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001 [ 466.262825] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000 [ 466.270802] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018 [ 466.278779] FS: 0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000 [ 466.287823] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 466.294246] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0 [ 466.302222] Call Trace: [ 466.304970] nfsd4_encode_fattr+0x15ac/0x1940 [nfsd] [ 466.310557] ? btrfs_verify_level_key+0xad/0xf0 [btrfs] [ 466.316413] ? btrfs_search_slot+0x8e3/0x900 [btrfs] [ 466.321973] nfsd4_encode_dirent+0x160/0x3b0 [nfsd] [ 466.327434] nfsd_readdir+0x199/0x240 [nfsd] [ 466.332215] ? nfsd4_encode_getattr+0x30/0x30 [nfsd] [ 466.337771] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd] [ 466.343714] ? security_prepare_creds+0x6f/0xa0 [ 466.348788] nfsd4_encode_readdir+0xd9/0x1c0 [nfsd] [ 466.354250] nfsd4_encode_operation+0x9b/0x1b0 [nfsd] [ 466.360430] nfsd4_proc_compound+0x4e3/0x710 [nfsd] [ 466.366352] nfsd_dispatch+0xd4/0x180 [nfsd] [ 466.371620] svc_process_common+0x392/0x6c0 [sunrpc] [ 466.377650] ? svc_recv+0x3c4/0x8a0 [sunrpc] [ 466.382883] ? nfsd_svc+0x300/0x300 [nfsd] [ 466.387908] ? nfsd_destroy+0x60/0x60 [nfsd] [ 466.393126] svc_process+0xb7/0xf0 [sunrpc] [ 466.398234] nfsd+0xe8/0x140 [nfsd] [ 466.402555] kthread+0x116/0x130 [ 466.406579] ? kthread_park+0x80/0x80 [ 466.411091] ret_from_fork+0x1f/0x30 [ 466.415499] Modules linked in: acpi_ipmi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill intel_rapl_m sr intel_rapl_common iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas ipmi_ssif sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_si rapl intel_cstate mei_me ipmi_devintf intel_uncore j oydev mei ipmi_msghandler lpc_ich acpi_power_meter nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt nfsd rdma _rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl lockd grace nfs_ssc ip_tables xfs mgag200 drm_kms_helper crct10dif_pclmul crc32_pclmul btrfs cec crc32c_intel xor bnx2x raid6_pq drm igb mpt3sas ghash_ clmulni_intel pcspkr nvme megaraid_sas mdio nvme_core dca raid_class i2c_algo_bit scsi_transport_sas wmi dm_mu ltipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc i2c_dev [ 466.499551] CR2: 0000000000000004 [ 466.503759] ---[ end trace 91eb52bf0cb65801 ]--- [ 466.511948] RIP: 0010:fsid_source+0x7/0x50 [nfsd] [ 466.517714] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83 [ 466.539753] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246 [ 466.546122] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00 [ 466.554625] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000 [ 466.563096] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001 [ 466.571572] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000 [ 466.580024] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018 [ 466.588487] FS: 0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000 [ 466.598032] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 466.604973] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0 [ 466.613467] Kernel panic - not syncing: Fatal exception [ 466.807651] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffff fffbfffffff) [ 466.823190] ---[ end Kernel panic - not syncing: Fatal exception ]--- Best Regards Wang Yugui (wangyugui@xxxxxxxxxxxx) 2021/06/23 > On Tue, 22 Jun 2021, Wang Yugui wrote: > > > > > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down(). > > > > btrfs subvol crossmnt begin to work, although buggy. > > > > some subvol is crossmnt-ed, some subvol is yet not, and some dir is > > wrongly crossmnt-ed > > > > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin > > to happen. > > > > This is the current patch based on 5.10.44. > > At least nfsd_follow_up() is buggy. > > > > I don't think the approach you are taking makes sense. Let me explain > why. > > The problem is that applications on the NFS client can see different > files or directories on the same (apparent) filesystem with the same > inode number. Most application won't care and NFS itself doesn't get > confused by the duplicate inode numbers, but 'find' and similar programs > (probably 'tar' for example) do get upset. > > This happens because BTRFS reuses inode numbers in subvols which it > presents to the kernel as all part of the one filesystem (or at least, > all part of the one mount point). NFSD only sees one filesystem, and so > reports the same filesystem-id (fsid) for all objects. The NFS client > then sees that the fsid is the same and tells applications that the > objects are all in the one filesystem. > > To fix this, we need to make sure that nfsd reports a different fsid for > objects in different subvols. There are two obvious ways to do this. > > One is to teach nfsd to recognize btrfs subvolumes exactly like separate > filesystems (as nfsd already ensure each filesystem gets its own fsid). > This is the approach of my first suggestion. It requires changing > nfsd_mountpoint() and follow_up() and any other code that is aware of > different filesytems. As I mentioned, it also requires changing mountd > to be able to extract a list of subvols from btrfs because they don't > appear in /proc/mounts. > > As you might know an NFS filehandle has 3 parts: a header, a filesystem > identifier, and an inode identifier. This approach would involve giving > different subvols different filesystem identifiers in the filehandle. > This, it turns out is a very big change - bigger than I at first > imagined. > > The second obvious approach is to leave the filehandles unchanged and to > continue to treat an entire btrfs filesystem as a single filesystem > EXCEPT when reporting the fsid to the NFS client. All we *really* need > to do is make sure the client sees a different fsid when it enters a > part of the filesystem which re-uses inode numbers. This is what my > latest patch did. > > Your patch seems to combine ideas from both approaches. It includes my > code to replace the fsid, but also intercepts follow_up etc. This > cannot be useful. > > As I noted when I posted it, there is a problem with my patch. I now > understand that problem. > > When NFS sees that fsid change it needs to create 2 inodes for that > directory. One inode will be in the parent filesystem and will be > marked as an auto-mount point so that any lookup below that directory > will trigger an internal mount. The other inode is the root of the > child filesystem. It gets mounted on the first inode. > > With normal filesystem mounts, there really is an inode in the parent > filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID > attribute. This fileid will be different from all other inode numbers > in the parent filesystem. > > With BTRFS there is no inode in the parent volume (as far as I know) so > there is nothing useful to return for MOUNTED_ON_FILEID. This results > in NFS using the same inode number for the inode in the parent > filesystem as the inode in the child filesystem. For btrfs, this will > be 256. As there is already an inode in the parent filesystem with inum > 256, 'find' complains. > > The following patch addresses this by adding code to nfsd when it > determines MOUINTD_ON_FILEID to choose an number that should be unused > in btrfs. With this change, 'find' seems to work correctly with NFSv4 > mounts of btrfs. > > This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID > attribute - strictly speaking, the NFSv3 protocol doesn't support > crossing mount points, though the Linux implementation does allow it. > > So this patch works and, I think, is the best we can do in terms of > functionality. I don't like the details of the implementation though. > It requires NFSD to know too much about BTRFS internals. > > I think I would like btrfs to make it clear where a subvol started, > maybe by setting DCACHE_MOUNTED on the dentry. This flag is only a > hint, not a promise of anything, so other code should get confused. > This would have nfsd calling vfs_statfs quite so often .... maybe that > isn't such a big deal. > > More importantly, there needs to be some way for NFSD to find an inode > number to report for the MOUNTED_ON_FILEID. This needs to be a number > not used elsewhere in the filesystem. It might be safe to use the > same fileid for all subvols (as my patch currently does), but we would > need to confirm that 'find' and 'tar' don't complain about that or > mishandle it. If it is safe to use the same fileid, then a new field in > the superblock to store it might work. If a different fileid is needed, > the we might need a new field in 'struct kstatfs', so vfs_statfs can > report it. > > Anyway, here is my current patch. It includes support for NFSv3 as well > as NFSv4. > > NeilBrown > > diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c > index 9421dae22737..790a3357525d 100644 > --- a/fs/nfsd/export.c > +++ b/fs/nfsd/export.c > @@ -15,6 +15,7 @@ > #include <linux/slab.h> > #include <linux/namei.h> > #include <linux/module.h> > +#include <linux/statfs.h> > #include <linux/exportfs.h> > #include <linux/sunrpc/svc_xprt.h> > > @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen) > int err; > struct auth_domain *dom = NULL; > struct svc_export exp = {}, *expp; > + struct kstatfs statfs; > int an_int; > > if (mesg[mlen-1] != '\n') > @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen) > err = kern_path(buf, 0, &exp.ex_path); > if (err) > goto out1; > + err = vfs_statfs(&exp.ex_path, &statfs); > + if (err) > + goto out3; > + exp.ex_fsid64 = statfs.f_fsid; > > exp.ex_client = dom; > exp.cd = cd; > @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem) > new->ex_anon_uid = item->ex_anon_uid; > new->ex_anon_gid = item->ex_anon_gid; > new->ex_fsid = item->ex_fsid; > + new->ex_fsid64 = item->ex_fsid64; > new->ex_devid_map = item->ex_devid_map; > item->ex_devid_map = NULL; > new->ex_uuid = item->ex_uuid; > diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h > index ee0e3aba4a6e..d3eb9a599918 100644 > --- a/fs/nfsd/export.h > +++ b/fs/nfsd/export.h > @@ -68,6 +68,7 @@ struct svc_export { > kuid_t ex_anon_uid; > kgid_t ex_anon_gid; > int ex_fsid; > + __kernel_fsid_t ex_fsid64; > unsigned char * ex_uuid; /* 16 byte fsid */ > struct nfsd4_fs_locations ex_fslocs; > uint32_t ex_nflavors; > diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c > index 0a5ebc52e6a9..f11ba3434fd6 100644 > --- a/fs/nfsd/nfs3xdr.c > +++ b/fs/nfsd/nfs3xdr.c > @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr, > case FSIDSOURCE_FSID: > fsid = (u64)fhp->fh_export->ex_fsid; > break; > - case FSIDSOURCE_UUID: > + case FSIDSOURCE_UUID: { > + struct kstatfs statfs; > + > fsid = ((u64 *)fhp->fh_export->ex_uuid)[0]; > fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1]; > + if (fh_getstafs(fhp, &statfs) == 0 && > + (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] || > + statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1])) > + /* looks like a btrfs subvol */ > + fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1]; > break; > + } > default: > fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev); > } > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 7abeccb975b2..5f614d1b362e 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -42,6 +42,7 @@ > #include <linux/sunrpc/svcauth_gss.h> > #include <linux/sunrpc/addr.h> > #include <linux/xattr.h> > +#include <linux/btrfs_tree.h> > #include <uapi/linux/xattr.h> > > #include "idmap.h" > @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > if (err) > goto out_nfserr; > if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE | > + FATTR4_WORD0_FSID | > FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) || > (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE | > + FATTR4_WORD1_MOUNTED_ON_FILEID | > FATTR4_WORD1_SPACE_TOTAL))) { > err = vfs_statfs(&path, &statfs); > if (err) > @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > case FSIDSOURCE_UUID: > p = xdr_encode_opaque_fixed(p, exp->ex_uuid, > EX_UUID_LEN); > + if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] || > + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) { > + /* looks like a btrfs subvol */ > + p[-2] ^= statfs.f_fsid.val[0]; > + p[-1] ^= statfs.f_fsid.val[1]; > + } > break; > } > } > @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp, > goto out_nfserr; > ino = parent_stat.ino; > } > + if (fsid_source(fhp) == FSIDSOURCE_UUID && > + (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] || > + statfs.f_fsid.val[1] != exp->ex_fsid64.val[1])) > + /* btrfs subvol pseudo mount point */ > + ino = BTRFS_FIRST_FREE_OBJECTID-1; > + > p = xdr_encode_hyper(p, ino); > } > #ifdef CONFIG_NFSD_PNFS > diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h > index b21b76e6b9a8..82b76b0b7bec 100644 > --- a/fs/nfsd/vfs.h > +++ b/fs/nfsd/vfs.h > @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat) > AT_STATX_SYNC_AS_STAT)); > } > > +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs) > +{ > + struct path p = {.mnt = fh->fh_export->ex_path.mnt, > + .dentry = fh->fh_dentry}; > + return nfserrno(vfs_statfs(&p, statfs)); > +} > + > static inline int nfsd_create_is_exclusive(int createmode) > { > return createmode == NFS3_CREATE_EXCLUSIVE
Attachment:
0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch
Description: Binary data