Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

Josef Bacik <josef@xxxxxxxxxxxxxx> · Thu, 2 Nov 2023 01:13:49 -0400

On Wed, Nov 01, 2023 at 10:52:18AM +0100, Christian Brauner wrote:
> On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> > 
> > 
> > On 2023/11/1 18:46, Christian Brauner wrote:
> > > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> > > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > > > > So this is effectively a request for:
> > > > > 
> > > > > btrfs subvolume create /mnt/subvol1
> > > > > 
> > > > > to create vfsmounts? IOW,
> > > > > 
> > > > > mkfs.btrfs /dev/sda
> > > > > mount /dev/sda /mnt
> > > > > btrfs subvolume create /mnt/subvol1
> > > > > btrfs subvolume create /mnt/subvol2
> > > > > 
> > > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > > > > afterwards?
> > > > 
> > > > Yes.
> > > > 
> > > > > That might be odd. Because these vfsmounts aren't really mounted, no?
> > > > 
> > > > Why aren't they?
> > > > 
> > > > > And so you'd be showing potentially hundreds of mounts in
> > > > > /proc/<pid>/mountinfo that you can't unmount?
> > > > 
> > > > Why would you not allow them to be unmounted?
> > > > 
> > > > > And even if you treat them as mounted what would unmounting mean?
> > > > 
> > > > The code in btrfs_lookup_dentry that does a hand crafted version
> > > > of the file system / subvolume crossing (the location.type !=
> > > > BTRFS_INODE_ITEM_KEY one) would not be executed.
> > > 
> > > So today, when we do:
> > > 
> > > mkfs.btrfs -f /dev/sda
> > > mount -t btrfs /dev/sda /mnt
> > > btrfs subvolume create /mnt/subvol1
> > > btrfs subvolume create /mnt/subvol2
> > > 
> > > Then all subvolumes are always visible under /mnt.
> > > IOW, you can't hide them other than by overmounting or destroying them.
> > > 
> > > If we make subvolumes vfsmounts then we very likely alter this behavior
> > > and I see two obvious options:
> > > 
> > > (1) They are fake vfsmounts that can't be unmounted:
> > > 
> > >      umount /mnt/subvol1 # returns -EINVAL
> > > 
> > >      This retains the invariant that every subvolume is always visible
> > >      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}
> > 
> > I'd like to go this option. But I still have a question.
> > 
> > How do we properly unmount a btrfs?
> > Do we have some other way to record which subvolume is really mounted
> > and which is just those place holder?
> 
> So the downside of this really is that this would be custom btrfs
> semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount
> only happens in weird corner cases today:
> 
> * mounts inherited during unprivileged mount namespace creation
> * locked mounts
> 
> Both of which are pretty inelegant and effectively only exist because of
> user namespaces. So if we can avoid proliferating such semantics it
> would be preferable.
> 
> I think it would also be rather confusing for userspace to be presented
> with a bunch of mounts in /proc/<pid>/mountinfo that it can't do
> anything with.
> 
> > > (2) They are proper vfsmounts:
> > > 
> > >      umount /mnt/subvol1 # succeeds
> > > 
> > >      This retains standard semantics for userspace about anything that
> > >      shows up in /proc/<pid>/mountinfo but means that after
> > >      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
> > >      the filesystem root /mnt anymore.
> > > 
> > > Both options can be made to work from a purely technical perspective,
> > > I'm asking which one it has to be because it isn't clear just from the
> > > snippets in this thread.
> > > 
> > > One should also point out that if each subvolume is a vfsmount, then say
> > > a btrfs filesystems with 1000 subvolumes which is mounted from the root:
> > > 
> > > mount -t btrfs /dev/sda /mnt
> > > 
> > > could be exploded into 1000 individual mounts. Which many users might not want.
> > 
> > Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> > timing here.
> 
> Probably, it would be an automount. Though I would have to recheck that
> code to see how exactly that would work but roughly, when you add the
> inode for the subvolume you raise S_AUTOMOUNT on it and then you add
> .d_automount for btrfs.

Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks
it does.

However I ran into some weirdness where I need to support the new mount API, so
that's what I've been doing since I wandered away from this thread.  I should
have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop
of that.

But I have the same questions as you Christian, I'm not entirely sure how this
is supposed to be better.  Even if they show up in /proc/mounts, it's not going
to do anything useful for the applications that don't check /proc/mounts to see
if they've wandered into a new mount.  I also don't quite understand how NFS
suddenly knows it's wandered into a new mount with a vfsmount.

At this point I'm tired of it being brought up in every conversation where we
try to expose more information to the users.  So I'll write the patches and as
long as they don't break anything we can merge it, but I don't think it'll make
a single bit of difference.

We'll be converted to the new mount API tho, so I suppose that's something.
Thanks,

Josef