On Fri, 16 Jul 2021, Josef Bacik wrote: > On 7/15/21 1:24 PM, Christoph Hellwig wrote: > > On Thu, Jul 15, 2021 at 01:11:29PM -0400, Josef Bacik wrote: > >> Because there's no alternative. We need a way to tell userspace they've > >> wandered into a different inode namespace. There's no argument that what > >> we're doing is ugly, but there's never been a clear "do X instead". Just a > >> lot of whinging that btrfs is broken. This makes userspace happy and is > >> simple and straightforward. I'm open to alternatives, but there have been 0 > >> workable alternatives proposed in the last decade of complaining about it. > > > > Make sure we cross a vfsmount when crossing the "st_dev" domain so > > that it is properly reported. Suggested many times and ignored all > > the time beause it requires a bit of work. > > > > You keep telling me this but forgetting that I did all this work when you > originally suggested it. The problem I ran into was the automount stuff > requires that we have a completely different superblock for every vfsmount. > This is fine for things like nfs or samba where the automount literally points > to a completely different mount, but doesn't work for btrfs where it's on the > same file system. If you have 1000 subvolumes and run sync() you're going to > write the superblock 1000 times for the same file system. You are going to > reclaim inodes on the same file system 1000 times. You are going to reclaim > dcache on the same filesytem 1000 times. You are also going to pin 1000 > dentries/inodes into memory whenever you wander into these things because the > super is going to hold them open. > > This is not a workable solution. It's not a matter of simply tying into > existing infrastructure, we'd have to completely rework how the VFS deals with > this stuff in order to be reasonable. And when I brought this up to Al he told > me I was insane and we absolutely had to have a different SB for every vfsmount, > which means we can't use vfsmount for this, which means we don't have any other > options. Thanks, When I was first looking at this, I thought that separate vfsmnts and auto-mounting was the way to go "just like NFS". NFS still shares a lot between the multiple superblock - certainly it shares the same connection to the server. But I dropped the idea when Bruce pointed out that nfsd is not set up to export auto-mounted filesystems. It needs to be able to find a filesystem given a UUID (extracted from a filehandle), and it does this by walking through the mount table to find one that matches. So unless all btrfs subvols were mounted all the time (which I wouldn't propose), it would need major work to fix. NFSv4 describes the fsid as having a "major" and "minor" component. We've never treated these as having an important meaning - just extra bits to encode uniqueness in. Maybe we should have used "major" for the vfsmnt, and kept "minor" for the subvol..... The idea for a single vfsmnt exposing multiple inode-name-spaces does appeal to me. The "st_dev" is just part of the name, and already a fairly blurry part. Thanks to bind mounts, multiple mounts can have the same st_dev. I see no intrinsic reason that a single mount should not have multiple fsids, provided that a coherent picture is provided to userspace which doesn't contain too many surprises. NeilBrown