On Thu, Nov 2, 2023 at 7:13 AM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote: > > On Wed, Nov 01, 2023 at 10:52:18AM +0100, Christian Brauner wrote: > > On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote: > > > > > > > > > On 2023/11/1 18:46, Christian Brauner wrote: > > > > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote: > > > > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote: > > > > > > So this is effectively a request for: > > > > > > > > > > > > btrfs subvolume create /mnt/subvol1 > > > > > > > > > > > > to create vfsmounts? IOW, > > > > > > > > > > > > mkfs.btrfs /dev/sda > > > > > > mount /dev/sda /mnt > > > > > > btrfs subvolume create /mnt/subvol1 > > > > > > btrfs subvolume create /mnt/subvol2 > > > > > > > > > > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo > > > > > > afterwards? > > > > > > > > > > Yes. > > > > > > > > > > > That might be odd. Because these vfsmounts aren't really mounted, no? > > > > > > > > > > Why aren't they? > > > > > > > > > > > And so you'd be showing potentially hundreds of mounts in > > > > > > /proc/<pid>/mountinfo that you can't unmount? > > > > > > > > > > Why would you not allow them to be unmounted? > > > > > > > > > > > And even if you treat them as mounted what would unmounting mean? > > > > > > > > > > The code in btrfs_lookup_dentry that does a hand crafted version > > > > > of the file system / subvolume crossing (the location.type != > > > > > BTRFS_INODE_ITEM_KEY one) would not be executed. > > > > > > > > So today, when we do: > > > > > > > > mkfs.btrfs -f /dev/sda > > > > mount -t btrfs /dev/sda /mnt > > > > btrfs subvolume create /mnt/subvol1 > > > > btrfs subvolume create /mnt/subvol2 > > > > > > > > Then all subvolumes are always visible under /mnt. > > > > IOW, you can't hide them other than by overmounting or destroying them. > > > > > > > > If we make subvolumes vfsmounts then we very likely alter this behavior > > > > and I see two obvious options: > > > > > > > > (1) They are fake vfsmounts that can't be unmounted: > > > > > > > > umount /mnt/subvol1 # returns -EINVAL > > > > > > > > This retains the invariant that every subvolume is always visible > > > > from the filesystems root, i.e., /mnt will include /mnt/subvol{1,} > > > > > > I'd like to go this option. But I still have a question. > > > > > > How do we properly unmount a btrfs? > > > Do we have some other way to record which subvolume is really mounted > > > and which is just those place holder? > > > > So the downside of this really is that this would be custom btrfs > > semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount > > only happens in weird corner cases today: > > > > * mounts inherited during unprivileged mount namespace creation > > * locked mounts > > > > Both of which are pretty inelegant and effectively only exist because of > > user namespaces. So if we can avoid proliferating such semantics it > > would be preferable. > > > > I think it would also be rather confusing for userspace to be presented > > with a bunch of mounts in /proc/<pid>/mountinfo that it can't do > > anything with. > > > > > > (2) They are proper vfsmounts: > > > > > > > > umount /mnt/subvol1 # succeeds > > > > > > > > This retains standard semantics for userspace about anything that > > > > shows up in /proc/<pid>/mountinfo but means that after > > > > umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from > > > > the filesystem root /mnt anymore. > > > > > > > > Both options can be made to work from a purely technical perspective, > > > > I'm asking which one it has to be because it isn't clear just from the > > > > snippets in this thread. > > > > > > > > One should also point out that if each subvolume is a vfsmount, then say > > > > a btrfs filesystems with 1000 subvolumes which is mounted from the root: > > > > > > > > mount -t btrfs /dev/sda /mnt > > > > > > > > could be exploded into 1000 individual mounts. Which many users might not want. > > > > > > Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect > > > timing here. > > > > Probably, it would be an automount. Though I would have to recheck that > > code to see how exactly that would work but roughly, when you add the > > inode for the subvolume you raise S_AUTOMOUNT on it and then you add > > .d_automount for btrfs. > > Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks > it does. > > However I ran into some weirdness where I need to support the new mount API, so > that's what I've been doing since I wandered away from this thread. I should > have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop > of that. > > But I have the same questions as you Christian, I'm not entirely sure how this > is supposed to be better. Even if they show up in /proc/mounts, it's not going > to do anything useful for the applications that don't check /proc/mounts to see > if they've wandered into a new mount. I also don't quite understand how NFS > suddenly knows it's wandered into a new mount with a vfsmount. > IIUC, the NFS server communicated the same/different filesystem to the client by means of an FSID. This is not the same f_fsid from statfs(), it's the FSID communicated to nfsd server from /etc/exports or guessed by nfsd server for blockdev fs. IIRC, the NFS FSID is 128bit (?). If we implemented the s_op->get_fsid() method that Jan has suggested [1], we could make it mean - "nfsd, please use this as the NFS FSID instead of assuming the all inodes on this sb should be presented to the client as a uniform FSID". [1] https://lore.kernel.org/linux-fsdevel/20231025135048.36153-2-amir73il@xxxxxxxxx/ > At this point I'm tired of it being brought up in every conversation where we > try to expose more information to the users. So I'll write the patches and as > long as they don't break anything we can merge it, but I don't think it'll make > a single bit of difference. > > We'll be converted to the new mount API tho, so I suppose that's something. Horray! I had just noticed that since Monday, we have a new fs on the block with multiple subvol on the same sb - bcachefs. I took a look at bch2_statfs(), bch2_encode_fh() and bch2_getattr() and I can't help wondering, what's different from btrfs? Kent, Is inode->v.i_ino unique for all inodes in bcachefs sb and inode->ei_inode.bi_inum unique within a subvol? If so, how is a cloned inode number allocated? BTW1: bch2_statfs() open codes uuid_to_fsid(). BTW2: Any reason why bcache fs uses bch_sb.uuid and not sb.s_uuid? If you publish bch_sb.uuid to vfs via sb.s_uuid, you can use bcachefs as a layer in overlayfs for advance features like NFS export of overlayfs - with s_uuid, those features will not be available for overlayfs over bcachefs. Thanks, Amir.