Re: any idea about auto export multiple btrfs snapshots?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

This patch works very well. Thanks a lot.
-  crossmnt of btrfs subvol works as expected.
-  nfs/umount subvol works well.
-  pseudo mount point inode(255) is good.

I test it in 5.10.45 with a few minor rebase.
( see 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch,
just fs/nfsd/nfs3xdr.c rebase)

But when I tested it with another btrfs system without subvol but with
more data, 'find /nfs/test' caused a OOPS .  and this OOPS will not
happen just without this patch.

The data in this filesystem is created/left by xfstest(FSTYP=nfs,
TEST_DEV).

#nfs4 option: default mount.nfs4, nfs-utils-2.3.3

# access btrfs directly
$ find /mnt/test | wc -l
6612

# access btrfs through nfs
$ find /nfs/test | wc -l

[  466.164329] BUG: kernel NULL pointer dereference, address: 0000000000000004
[  466.172123] #PF: supervisor read access in kernel mode
[  466.177857] #PF: error_code(0x0000) - not-present page
[  466.183601] PGD 0 P4D 0
[  466.186443] Oops: 0000 [#1] SMP NOPTI
[  466.190536] CPU: 27 PID: 1819 Comm: nfsd Not tainted 5.10.45-7.el7.x86_64 #1
[  466.198418] Hardware name: Dell Inc. PowerEdge T620/02CD1V, BIOS 2.9.0 12/06/2019
[  466.206806] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[  466.212067] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[  466.233061] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[  466.238894] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[  466.246872] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[  466.254848] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[  466.262825] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[  466.270802] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[  466.278779] FS:  0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[  466.287823] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.294246] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[  466.302222] Call Trace:
[  466.304970]  nfsd4_encode_fattr+0x15ac/0x1940 [nfsd]
[  466.310557]  ? btrfs_verify_level_key+0xad/0xf0 [btrfs]
[  466.316413]  ? btrfs_search_slot+0x8e3/0x900 [btrfs]
[  466.321973]  nfsd4_encode_dirent+0x160/0x3b0 [nfsd]
[  466.327434]  nfsd_readdir+0x199/0x240 [nfsd]
[  466.332215]  ? nfsd4_encode_getattr+0x30/0x30 [nfsd]
[  466.337771]  ? nfsd_direct_splice_actor+0x20/0x20 [nfsd]
[  466.343714]  ? security_prepare_creds+0x6f/0xa0
[  466.348788]  nfsd4_encode_readdir+0xd9/0x1c0 [nfsd]
[  466.354250]  nfsd4_encode_operation+0x9b/0x1b0 [nfsd]
[  466.360430]  nfsd4_proc_compound+0x4e3/0x710 [nfsd]
[  466.366352]  nfsd_dispatch+0xd4/0x180 [nfsd]
[  466.371620]  svc_process_common+0x392/0x6c0 [sunrpc]
[  466.377650]  ? svc_recv+0x3c4/0x8a0 [sunrpc]
[  466.382883]  ? nfsd_svc+0x300/0x300 [nfsd]
[  466.387908]  ? nfsd_destroy+0x60/0x60 [nfsd]
[  466.393126]  svc_process+0xb7/0xf0 [sunrpc]
[  466.398234]  nfsd+0xe8/0x140 [nfsd]
[  466.402555]  kthread+0x116/0x130
[  466.406579]  ? kthread_park+0x80/0x80
[  466.411091]  ret_from_fork+0x1f/0x30
[  466.415499] Modules linked in: acpi_ipmi rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill intel_rapl_m
sr intel_rapl_common iTCO_wdt intel_pmc_bxt iTCO_vendor_support dcdbas ipmi_ssif sb_edac x86_pkg_temp_thermal
intel_powerclamp coretemp kvm_intel kvm irqbypass ipmi_si rapl intel_cstate mei_me ipmi_devintf intel_uncore j
oydev mei ipmi_msghandler lpc_ich acpi_power_meter nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt nfsd rdma
_rxe ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl lockd grace nfs_ssc ip_tables xfs mgag200
 drm_kms_helper crct10dif_pclmul crc32_pclmul btrfs cec crc32c_intel xor bnx2x raid6_pq drm igb mpt3sas ghash_
clmulni_intel pcspkr nvme megaraid_sas mdio nvme_core dca raid_class i2c_algo_bit scsi_transport_sas wmi dm_mu
ltipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua sunrpc i2c_dev
[  466.499551] CR2: 0000000000000004
[  466.503759] ---[ end trace 91eb52bf0cb65801 ]---
[  466.511948] RIP: 0010:fsid_source+0x7/0x50 [nfsd]
[  466.517714] Code: e8 3e f9 ff ff 48 c7 c7 40 5a 90 c0 48 89 c6 e8 18 5a 1f d3 44 8b 14 24 e9 a2 f9 ff ff e9
 f7 3e 03 00 90 0f 1f 44 00 00 31 c0 <80> 7f 04 01 75 2d 0f b6 47 06 48 8b 97 90 00 00 00 84 c0 74 1f 83
[  466.539753] RSP: 0018:ffff9cdd0d3479d0 EFLAGS: 00010246
[  466.546122] RAX: 0000000000000000 RBX: 0000000000010abc RCX: ffff8f50f3049b00
[  466.554625] RDX: 0000000000000008 RSI: 0000000000000000 RDI: 0000000000000000
[  466.563096] RBP: ffff9cdd0d347c68 R08: 0000000aaeb00000 R09: 0000000000000001
[  466.571572] R10: 0000000000010000 R11: 0000000000110000 R12: ffff8f30510f8000
[  466.580024] R13: ffff8f4fdabb2090 R14: ffff8f30c0b95600 R15: 0000000000000018
[  466.588487] FS:  0000000000000000(0000) GS:ffff8f5f7fb40000(0000) knlGS:0000000000000000
[  466.598032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  466.604973] CR2: 0000000000000004 CR3: 00000014bfa10003 CR4: 00000000001706e0
[  466.613467] Kernel panic - not syncing: Fatal exception
[  466.807651] Kernel Offset: 0x12000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xfffff
fffbfffffff)
[  466.823190] ---[ end Kernel panic - not syncing: Fatal exception ]---


Best Regards
Wang Yugui (wangyugui@xxxxxxxxxxxx)
2021/06/23

> On Tue, 22 Jun 2021, Wang Yugui wrote:
> > > 
> > > btrfs subvol should be treated as virtual 'mount point' for nfsd in follow_down().
> > 
> > btrfs subvol crossmnt begin to work, although buggy.
> > 
> > some subvol is crossmnt-ed, some subvol is yet not, and some dir is
> > wrongly crossmnt-ed
> > 
> > 'stat /nfs/test /nfs/test/sub1' will cause btrfs subvol crossmnt begin
> > to happen.
> > 
> > This is the current patch based on 5.10.44. 
> > At least nfsd_follow_up() is buggy.
> > 
> 
> I don't think the approach you are taking makes sense.  Let me explain
> why.
> 
> The problem is that applications on the NFS client can see different
> files or directories on the same (apparent) filesystem with the same
> inode number.  Most application won't care and NFS itself doesn't get
> confused by the duplicate inode numbers, but 'find' and similar programs
> (probably 'tar' for example) do get upset.
> 
> This happens because BTRFS reuses inode numbers in subvols which it
> presents to the kernel as all part of the one filesystem (or at least,
> all part of the one mount point).  NFSD only sees one filesystem, and so
> reports the same filesystem-id (fsid) for all objects.  The NFS client
> then sees that the fsid is the same and tells applications that the
> objects are all in the one filesystem.
> 
> To fix this, we need to make sure that nfsd reports a different fsid for
> objects in different subvols.  There are two obvious ways to do this.
> 
> One is to teach nfsd to recognize btrfs subvolumes exactly like separate
> filesystems (as nfsd already ensure each filesystem gets its own fsid).
> This is the approach of my first suggestion.  It requires changing
> nfsd_mountpoint() and follow_up() and any other code that is aware of
> different filesytems.  As I mentioned, it also requires changing mountd
> to be able to extract a list of subvols from btrfs because they don't
> appear in /proc/mounts.  
> 
> As you might know an NFS filehandle has 3 parts: a header, a filesystem
> identifier, and an inode identifier.  This approach would involve giving
> different subvols different filesystem identifiers in the filehandle.
> This, it turns out is a very big change - bigger than I at first
> imagined.
> 
> The second obvious approach is to leave the filehandles unchanged and to
> continue to treat an entire btrfs filesystem as a single filesystem
> EXCEPT when reporting the fsid to the NFS client.  All we *really* need
> to do is make sure the client sees a different fsid when it enters a
> part of the filesystem which re-uses inode numbers.  This is what my
> latest patch did.
> 
> Your patch seems to combine ideas from both approaches.  It includes my
> code to replace the fsid, but also intercepts follow_up etc.  This
> cannot be useful.
> 
> As I noted when I posted it, there is a problem with my patch.  I now
> understand that problem.
> 
> When NFS sees that fsid change it needs to create 2 inodes for that
> directory.  One inode will be in the parent filesystem and will be
> marked as an auto-mount point so that any lookup below that directory
> will trigger an internal mount.  The other inode is the root of the
> child filesystem.  It gets mounted on the first inode.
> 
> With normal filesystem mounts, there really is an inode in the parent
> filesystem and NFS can find it (with NFSv4) using the MOUNTED_ON_FILEID
> attribute.  This fileid will be different from all other inode numbers
> in the parent filesystem.
> 
> With BTRFS there is no inode in the parent volume (as far as I know) so
> there is nothing useful to return for MOUNTED_ON_FILEID.  This results
> in NFS using the same inode number for the inode in the parent
> filesystem as the inode in the child filesystem.  For btrfs, this will
> be 256.  As there is already an inode in the parent filesystem with inum
> 256, 'find' complains.
> 
> The following patch addresses this by adding code to nfsd when it
> determines MOUINTD_ON_FILEID to choose an number that should be unused
> in btrfs.  With this change, 'find' seems to work correctly with NFSv4
> mounts of btrfs.
> 
> This doesn't work with NFSv3 as NFSv3 doesn't have the MOUNTED_ON_FILEID
> attribute - strictly speaking, the NFSv3 protocol doesn't support
> crossing mount points, though the Linux implementation does allow it.
> 
> So this patch works and, I think, is the best we can do in terms of
> functionality.  I don't like the details of the implementation though.
> It requires NFSD to know too much about BTRFS internals.
> 
> I think I would like btrfs to make it clear where a subvol started,
> maybe by setting DCACHE_MOUNTED on the dentry.  This flag is only a
> hint, not a promise of anything, so other code should get confused.
> This would have nfsd calling vfs_statfs quite so often ....  maybe that
> isn't such a big deal.
> 
> More importantly, there needs to be some way for NFSD to find an inode
> number to report for the MOUNTED_ON_FILEID.  This needs to be a number
> not used elsewhere in the filesystem.  It might be safe to use the
> same fileid for all subvols (as my patch currently does), but we would
> need to confirm that 'find' and 'tar' don't complain about that or
> mishandle it.  If it is safe to use the same fileid, then a new field in
> the superblock to store it might work.  If a different fileid is needed,
> the we might need a new field in 'struct kstatfs', so vfs_statfs can
> report it.
> 
> Anyway, here is my current patch.  It includes support for NFSv3 as well
> as NFSv4.
> 
> NeilBrown
> 
> diff --git a/fs/nfsd/export.c b/fs/nfsd/export.c
> index 9421dae22737..790a3357525d 100644
> --- a/fs/nfsd/export.c
> +++ b/fs/nfsd/export.c
> @@ -15,6 +15,7 @@
>  #include <linux/slab.h>
>  #include <linux/namei.h>
>  #include <linux/module.h>
> +#include <linux/statfs.h>
>  #include <linux/exportfs.h>
>  #include <linux/sunrpc/svc_xprt.h>
>  
> @@ -575,6 +576,7 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	int err;
>  	struct auth_domain *dom = NULL;
>  	struct svc_export exp = {}, *expp;
> +	struct kstatfs statfs;
>  	int an_int;
>  
>  	if (mesg[mlen-1] != '\n')
> @@ -604,6 +606,10 @@ static int svc_export_parse(struct cache_detail *cd, char *mesg, int mlen)
>  	err = kern_path(buf, 0, &exp.ex_path);
>  	if (err)
>  		goto out1;
> +	err = vfs_statfs(&exp.ex_path, &statfs);
> +	if (err)
> +		goto out3;
> +	exp.ex_fsid64 = statfs.f_fsid;
>  
>  	exp.ex_client = dom;
>  	exp.cd = cd;
> @@ -809,6 +815,7 @@ static void export_update(struct cache_head *cnew, struct cache_head *citem)
>  	new->ex_anon_uid = item->ex_anon_uid;
>  	new->ex_anon_gid = item->ex_anon_gid;
>  	new->ex_fsid = item->ex_fsid;
> +	new->ex_fsid64 = item->ex_fsid64;
>  	new->ex_devid_map = item->ex_devid_map;
>  	item->ex_devid_map = NULL;
>  	new->ex_uuid = item->ex_uuid;
> diff --git a/fs/nfsd/export.h b/fs/nfsd/export.h
> index ee0e3aba4a6e..d3eb9a599918 100644
> --- a/fs/nfsd/export.h
> +++ b/fs/nfsd/export.h
> @@ -68,6 +68,7 @@ struct svc_export {
>  	kuid_t			ex_anon_uid;
>  	kgid_t			ex_anon_gid;
>  	int			ex_fsid;
> +	__kernel_fsid_t		ex_fsid64;
>  	unsigned char *		ex_uuid; /* 16 byte fsid */
>  	struct nfsd4_fs_locations ex_fslocs;
>  	uint32_t		ex_nflavors;
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 0a5ebc52e6a9..f11ba3434fd6 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -367,10 +367,18 @@ svcxdr_encode_fattr3(struct svc_rqst *rqstp, struct xdr_stream *xdr,
>  	case FSIDSOURCE_FSID:
>  		fsid = (u64)fhp->fh_export->ex_fsid;
>  		break;
> -	case FSIDSOURCE_UUID:
> +	case FSIDSOURCE_UUID: {
> +		struct kstatfs statfs;
> +
>  		fsid = ((u64 *)fhp->fh_export->ex_uuid)[0];
>  		fsid ^= ((u64 *)fhp->fh_export->ex_uuid)[1];
> +		if (fh_getstafs(fhp, &statfs) == 0 &&
> +		    (statfs.f_fsid.val[0] != fhp->fh_export->ex_fsid64.val[0] ||
> +		     statfs.f_fsid.val[1] != fhp->fh_export->ex_fsid64.val[1]))
> +			/* looks like a btrfs subvol */
> +			fsid = statfs.f_fsid.val[0] ^ statfs.f_fsid.val[1];
>  		break;
> +		}
>  	default:
>  		fsid = (u64)huge_encode_dev(fhp->fh_dentry->d_sb->s_dev);
>  	}
> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
> index 7abeccb975b2..5f614d1b362e 100644
> --- a/fs/nfsd/nfs4xdr.c
> +++ b/fs/nfsd/nfs4xdr.c
> @@ -42,6 +42,7 @@
>  #include <linux/sunrpc/svcauth_gss.h>
>  #include <linux/sunrpc/addr.h>
>  #include <linux/xattr.h>
> +#include <linux/btrfs_tree.h>
>  #include <uapi/linux/xattr.h>
>  
>  #include "idmap.h"
> @@ -2869,8 +2870,10 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  	if (err)
>  		goto out_nfserr;
>  	if ((bmval0 & (FATTR4_WORD0_FILES_AVAIL | FATTR4_WORD0_FILES_FREE |
> +		       FATTR4_WORD0_FSID |
>  			FATTR4_WORD0_FILES_TOTAL | FATTR4_WORD0_MAXNAME)) ||
>  	    (bmval1 & (FATTR4_WORD1_SPACE_AVAIL | FATTR4_WORD1_SPACE_FREE |
> +		       FATTR4_WORD1_MOUNTED_ON_FILEID |
>  		       FATTR4_WORD1_SPACE_TOTAL))) {
>  		err = vfs_statfs(&path, &statfs);
>  		if (err)
> @@ -3024,6 +3027,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  		case FSIDSOURCE_UUID:
>  			p = xdr_encode_opaque_fixed(p, exp->ex_uuid,
>  								EX_UUID_LEN);
> +			if (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +			    statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]) {
> +				/* looks like a btrfs subvol */
> +				p[-2] ^= statfs.f_fsid.val[0];
> +				p[-1] ^= statfs.f_fsid.val[1];
> +			}
>  			break;
>  		}
>  	}
> @@ -3286,6 +3295,12 @@ nfsd4_encode_fattr(struct xdr_stream *xdr, struct svc_fh *fhp,
>  				goto out_nfserr;
>  			ino = parent_stat.ino;
>  		}
> +		if (fsid_source(fhp) == FSIDSOURCE_UUID &&
> +		    (statfs.f_fsid.val[0] != exp->ex_fsid64.val[0] ||
> +		     statfs.f_fsid.val[1] != exp->ex_fsid64.val[1]))
> +			    /* btrfs subvol pseudo mount point */
> +			    ino = BTRFS_FIRST_FREE_OBJECTID-1;
> +
>  		p = xdr_encode_hyper(p, ino);
>  	}
>  #ifdef CONFIG_NFSD_PNFS
> diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
> index b21b76e6b9a8..82b76b0b7bec 100644
> --- a/fs/nfsd/vfs.h
> +++ b/fs/nfsd/vfs.h
> @@ -160,6 +160,13 @@ static inline __be32 fh_getattr(const struct svc_fh *fh, struct kstat *stat)
>  				    AT_STATX_SYNC_AS_STAT));
>  }
>  
> +static inline __be32 fh_getstafs(const struct svc_fh *fh, struct kstatfs *statfs)
> +{
> +	struct path p = {.mnt = fh->fh_export->ex_path.mnt,
> +			 .dentry = fh->fh_dentry};
> +	return nfserrno(vfs_statfs(&p, statfs));
> +}
> +
>  static inline int nfsd_create_is_exclusive(int createmode)
>  {
>  	return createmode == NFS3_CREATE_EXCLUSIVE

Attachment: 0001-any-idea-about-auto-export-multiple-btrfs-snapshots.patch
Description: Binary data


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux