On Fri, Jul 16, 2021 at 08:37:07AM +1000, NeilBrown wrote: > On Fri, 16 Jul 2021, Josef Bacik wrote: > > On 7/15/21 1:24 PM, Christoph Hellwig wrote: > > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote: > > >> Because there's no alternative. We need a way to tell userspace they've > > >> wandered into a different inode namespace. There's no argument that what > > >> we're doing is ugly, but there's never been a clear "do X instead". Just a > > >> lot of whinging that btrfs is broken. This makes userspace happy and is > > >> simple and straightforward. I'm open to alternatives, but there have been 0 > > >> workable alternatives proposed in the last decade of complaining about it. > > > > > > Make sure we cross a vfsmount when crossing the "st_dev" domain so > > > that it is properly reported. Suggested many times and ignored all > > > the time beause it requires a bit of work. > > > > > > > You keep telling me this but forgetting that I did all this work when you > > originally suggested it. The problem I ran into was the automount stuff > > requires that we have a completely different superblock for every vfsmount. > > This is fine for things like nfs or samba where the automount literally points > > to a completely different mount, but doesn't work for btrfs where it's on the > > same file system. If you have 1000 subvolumes and run sync() you're going to > > write the superblock 1000 times for the same file system. You are going to > > reclaim inodes on the same file system 1000 times. You are going to reclaim > > dcache on the same filesytem 1000 times. You are also going to pin 1000 > > dentries/inodes into memory whenever you wander into these things because the > > super is going to hold them open. > > > > This is not a workable solution. It's not a matter of simply tying into > > existing infrastructure, we'd have to completely rework how the VFS deals with > > this stuff in order to be reasonable. And when I brought this up to Al he told > > me I was insane and we absolutely had to have a different SB for every vfsmount, > > which means we can't use vfsmount for this, which means we don't have any other > > options. Thanks, > > When I was first looking at this, I thought that separate vfsmnts > and auto-mounting was the way to go "just like NFS". NFS still shares a > lot between the multiple superblock - certainly it shares the same > connection to the server. > > But I dropped the idea when Bruce pointed out that nfsd is not set up to > export auto-mounted filesystems. Yes. I wish it was.... But we'd need some way to look a not-currently-mounted filesystem by filehandle: > It needs to be able to find a > filesystem given a UUID (extracted from a filehandle), and it does this > by walking through the mount table to find one that matches. So unless > all btrfs subvols were mounted all the time (which I wouldn't propose), > it would need major work to fix. > > NFSv4 describes the fsid as having a "major" and "minor" component. > We've never treated these as having an important meaning - just extra > bits to encode uniqueness in. Maybe we should have used "major" for the > vfsmnt, and kept "minor" for the subvol..... So nfsd would use the "major" ID to find the parent export, and then btrfs would use the "minor" ID to identify the subvolume? --b. > The idea for a single vfsmnt exposing multiple inode-name-spaces does > appeal to me. The "st_dev" is just part of the name, and already a > fairly blurry part. Thanks to bind mounts, multiple mounts can have the > same st_dev. I see no intrinsic reason that a single mount should not > have multiple fsids, provided that a coherent picture is provided to > userspace which doesn't contain too many surprises.