Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

Christian Brauner <brauner@xxxxxxxxxx> · Wed, 1 Nov 2023 10:52:18 +0100

On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> 
> 
> On 2023/11/1 18:46, Christian Brauner wrote:
> > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > > > So this is effectively a request for:
> > > > 
> > > > btrfs subvolume create /mnt/subvol1
> > > > 
> > > > to create vfsmounts? IOW,
> > > > 
> > > > mkfs.btrfs /dev/sda
> > > > mount /dev/sda /mnt
> > > > btrfs subvolume create /mnt/subvol1
> > > > btrfs subvolume create /mnt/subvol2
> > > > 
> > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > > > afterwards?
> > > 
> > > Yes.
> > > 
> > > > That might be odd. Because these vfsmounts aren't really mounted, no?
> > > 
> > > Why aren't they?
> > > 
> > > > And so you'd be showing potentially hundreds of mounts in
> > > > /proc/<pid>/mountinfo that you can't unmount?
> > > 
> > > Why would you not allow them to be unmounted?
> > > 
> > > > And even if you treat them as mounted what would unmounting mean?
> > > 
> > > The code in btrfs_lookup_dentry that does a hand crafted version
> > > of the file system / subvolume crossing (the location.type !=
> > > BTRFS_INODE_ITEM_KEY one) would not be executed.
> > 
> > So today, when we do:
> > 
> > mkfs.btrfs -f /dev/sda
> > mount -t btrfs /dev/sda /mnt
> > btrfs subvolume create /mnt/subvol1
> > btrfs subvolume create /mnt/subvol2
> > 
> > Then all subvolumes are always visible under /mnt.
> > IOW, you can't hide them other than by overmounting or destroying them.
> > 
> > If we make subvolumes vfsmounts then we very likely alter this behavior
> > and I see two obvious options:
> > 
> > (1) They are fake vfsmounts that can't be unmounted:
> > 
> >      umount /mnt/subvol1 # returns -EINVAL
> > 
> >      This retains the invariant that every subvolume is always visible
> >      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}
> 
> I'd like to go this option. But I still have a question.
> 
> How do we properly unmount a btrfs?
> Do we have some other way to record which subvolume is really mounted
> and which is just those place holder?

So the downside of this really is that this would be custom btrfs
semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount
only happens in weird corner cases today:

* mounts inherited during unprivileged mount namespace creation
* locked mounts

Both of which are pretty inelegant and effectively only exist because of
user namespaces. So if we can avoid proliferating such semantics it
would be preferable.

I think it would also be rather confusing for userspace to be presented
with a bunch of mounts in /proc/<pid>/mountinfo that it can't do
anything with.

> > (2) They are proper vfsmounts:
> > 
> >      umount /mnt/subvol1 # succeeds
> > 
> >      This retains standard semantics for userspace about anything that
> >      shows up in /proc/<pid>/mountinfo but means that after
> >      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
> >      the filesystem root /mnt anymore.
> > 
> > Both options can be made to work from a purely technical perspective,
> > I'm asking which one it has to be because it isn't clear just from the
> > snippets in this thread.
> > 
> > One should also point out that if each subvolume is a vfsmount, then say
> > a btrfs filesystems with 1000 subvolumes which is mounted from the root:
> > 
> > mount -t btrfs /dev/sda /mnt
> > 
> > could be exploded into 1000 individual mounts. Which many users might not want.
> 
> Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> timing here.

Probably, it would be an automount. Though I would have to recheck that
code to see how exactly that would work but roughly, when you add the
inode for the subvolume you raise S_AUTOMOUNT on it and then you add
.d_automount for btrfs.