Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Mon, 19 Jul 2021 11:49:07 -0400

On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote:
> On Fri, 16 Jul 2021, Josef Bacik wrote:
> > On 7/15/21 1:24 PM, Christoph Hellwig wrote:
> > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
> > >> Because there's no alternative.  We need a way to tell userspace they've
> > >> wandered into a different inode namespace.  There's no argument that what
> > >> we're doing is ugly, but there's never been a clear "do X instead".  Just a
> > >> lot of whinging that btrfs is broken.  This makes userspace happy and is
> > >> simple and straightforward.  I'm open to alternatives, but there have been 0
> > >> workable alternatives proposed in the last decade of complaining about it.
> > > 
> > > Make sure we cross a vfsmount when crossing the "st_dev" domain so
> > > that it is properly reported.   Suggested many times and ignored all
> > > the time beause it requires a bit of work.
> > > 
> > 
> > You keep telling me this but forgetting that I did all this work when you 
> > originally suggested it.  The problem I ran into was the automount stuff 
> > requires that we have a completely different superblock for every vfsmount. 
> > This is fine for things like nfs or samba where the automount literally points 
> > to a completely different mount, but doesn't work for btrfs where it's on the 
> > same file system.  If you have 1000 subvolumes and run sync() you're going to 
> > write the superblock 1000 times for the same file system.  You are going to 
> > reclaim inodes on the same file system 1000 times.  You are going to reclaim 
> > dcache on the same filesytem 1000 times.  You are also going to pin 1000 
> > dentries/inodes into memory whenever you wander into these things because the 
> > super is going to hold them open.
> > 
> > This is not a workable solution.  It's not a matter of simply tying into 
> > existing infrastructure, we'd have to completely rework how the VFS deals with 
> > this stuff in order to be reasonable.  And when I brought this up to Al he told 
> > me I was insane and we absolutely had to have a different SB for every vfsmount, 
> > which means we can't use vfsmount for this, which means we don't have any other 
> > options.  Thanks,
> 
> When I was first looking at this, I thought that separate vfsmnts
> and auto-mounting was the way to go "just like NFS".  NFS still shares a
> lot between the multiple superblock - certainly it shares the same
> connection to the server.
> 
> But I dropped the idea when Bruce pointed out that nfsd is not set up to
> export auto-mounted filesystems.

Yes.  I wish it was....  But we'd need some way to look a
not-currently-mounted filesystem by filehandle:

> It needs to be able to find a
> filesystem given a UUID (extracted from a filehandle), and it does this
> by walking through the mount table to find one that matches.  So unless
> all btrfs subvols were mounted all the time (which I wouldn't propose),
> it would need major work to fix.
> 
> NFSv4 describes the fsid as having a "major" and "minor" component.
> We've never treated these as having an important meaning - just extra
> bits to encode uniqueness in.  Maybe we should have used "major" for the
> vfsmnt, and kept "minor" for the subvol.....

So nfsd would use the "major" ID to find the parent export, and then
btrfs would use the "minor" ID to identify the subvolume?

--b.

> The idea for a single vfsmnt exposing multiple inode-name-spaces does
> appeal to me.  The "st_dev" is just part of the name, and already a
> fairly blurry part.  Thanks to bind mounts, multiple mounts can have the
> same st_dev.  I see no intrinsic reason that a single mount should not
> have multiple fsids, provided that a coherent picture is provided to
> userspace which doesn't contain too many surprises.