Re: [PATCH/RFC] NFSD: handle BTRFS subvolumes better.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 7/15/21 6:37 PM, NeilBrown wrote:
On Fri, 16 Jul 2021, Josef Bacik wrote:
On 7/15/21 1:24 PM, Christoph Hellwig wrote:
On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote:
Because there's no alternative.  We need a way to tell userspace they've
wandered into a different inode namespace.  There's no argument that what
we're doing is ugly, but there's never been a clear "do X instead".  Just a
lot of whinging that btrfs is broken.  This makes userspace happy and is
simple and straightforward.  I'm open to alternatives, but there have been 0
workable alternatives proposed in the last decade of complaining about it.

Make sure we cross a vfsmount when crossing the "st_dev" domain so
that it is properly reported.   Suggested many times and ignored all
the time beause it requires a bit of work.


You keep telling me this but forgetting that I did all this work when you
originally suggested it.  The problem I ran into was the automount stuff
requires that we have a completely different superblock for every vfsmount.
This is fine for things like nfs or samba where the automount literally points
to a completely different mount, but doesn't work for btrfs where it's on the
same file system.  If you have 1000 subvolumes and run sync() you're going to
write the superblock 1000 times for the same file system.  You are going to
reclaim inodes on the same file system 1000 times.  You are going to reclaim
dcache on the same filesytem 1000 times.  You are also going to pin 1000
dentries/inodes into memory whenever you wander into these things because the
super is going to hold them open.

This is not a workable solution.  It's not a matter of simply tying into
existing infrastructure, we'd have to completely rework how the VFS deals with
this stuff in order to be reasonable.  And when I brought this up to Al he told
me I was insane and we absolutely had to have a different SB for every vfsmount,
which means we can't use vfsmount for this, which means we don't have any other
options.  Thanks,

When I was first looking at this, I thought that separate vfsmnts
and auto-mounting was the way to go "just like NFS".  NFS still shares a
lot between the multiple superblock - certainly it shares the same
connection to the server.

But I dropped the idea when Bruce pointed out that nfsd is not set up to
export auto-mounted filesystems.  It needs to be able to find a
filesystem given a UUID (extracted from a filehandle), and it does this
by walking through the mount table to find one that matches.  So unless
all btrfs subvols were mounted all the time (which I wouldn't propose),
it would need major work to fix.

NFSv4 describes the fsid as having a "major" and "minor" component.
We've never treated these as having an important meaning - just extra
bits to encode uniqueness in.  Maybe we should have used "major" for the
vfsmnt, and kept "minor" for the subvol.....

The idea for a single vfsmnt exposing multiple inode-name-spaces does
appeal to me.  The "st_dev" is just part of the name, and already a
fairly blurry part.  Thanks to bind mounts, multiple mounts can have the
same st_dev.  I see no intrinsic reason that a single mount should not
have multiple fsids, provided that a coherent picture is provided to
userspace which doesn't contain too many surprises.


Ok so setting aside btrfs for the moment, how does NFS deal with exporting a directory that has multiple other file systems under that tree? I assume the same sort of problem doesn't occur, but why is that? Is it because it's a different vfsmount/sb or is there some other magic making this work? Thanks,

Josef




[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux