Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

Amir Goldstein <amir73il@xxxxxxxxx> · Thu, 2 Nov 2023 10:53:49 +0200

On Thu, Nov 2, 2023 at 7:13 AM Josef Bacik <josef@xxxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 01, 2023 at 10:52:18AM +0100, Christian Brauner wrote:
> > On Wed, Nov 01, 2023 at 07:11:53PM +1030, Qu Wenruo wrote:
> > >
> > >
> > > On 2023/11/1 18:46, Christian Brauner wrote:
> > > > On Tue, Oct 31, 2023 at 10:06:17AM -0700, Christoph Hellwig wrote:
> > > > > On Tue, Oct 31, 2023 at 01:50:46PM +0100, Christian Brauner wrote:
> > > > > > So this is effectively a request for:
> > > > > >
> > > > > > btrfs subvolume create /mnt/subvol1
> > > > > >
> > > > > > to create vfsmounts? IOW,
> > > > > >
> > > > > > mkfs.btrfs /dev/sda
> > > > > > mount /dev/sda /mnt
> > > > > > btrfs subvolume create /mnt/subvol1
> > > > > > btrfs subvolume create /mnt/subvol2
> > > > > >
> > > > > > would create two new vfsmounts that are exposed in /proc/<pid>/mountinfo
> > > > > > afterwards?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > That might be odd. Because these vfsmounts aren't really mounted, no?
> > > > >
> > > > > Why aren't they?
> > > > >
> > > > > > And so you'd be showing potentially hundreds of mounts in
> > > > > > /proc/<pid>/mountinfo that you can't unmount?
> > > > >
> > > > > Why would you not allow them to be unmounted?
> > > > >
> > > > > > And even if you treat them as mounted what would unmounting mean?
> > > > >
> > > > > The code in btrfs_lookup_dentry that does a hand crafted version
> > > > > of the file system / subvolume crossing (the location.type !=
> > > > > BTRFS_INODE_ITEM_KEY one) would not be executed.
> > > >
> > > > So today, when we do:
> > > >
> > > > mkfs.btrfs -f /dev/sda
> > > > mount -t btrfs /dev/sda /mnt
> > > > btrfs subvolume create /mnt/subvol1
> > > > btrfs subvolume create /mnt/subvol2
> > > >
> > > > Then all subvolumes are always visible under /mnt.
> > > > IOW, you can't hide them other than by overmounting or destroying them.
> > > >
> > > > If we make subvolumes vfsmounts then we very likely alter this behavior
> > > > and I see two obvious options:
> > > >
> > > > (1) They are fake vfsmounts that can't be unmounted:
> > > >
> > > >      umount /mnt/subvol1 # returns -EINVAL
> > > >
> > > >      This retains the invariant that every subvolume is always visible
> > > >      from the filesystems root, i.e., /mnt will include /mnt/subvol{1,}
> > >
> > > I'd like to go this option. But I still have a question.
> > >
> > > How do we properly unmount a btrfs?
> > > Do we have some other way to record which subvolume is really mounted
> > > and which is just those place holder?
> >
> > So the downside of this really is that this would be custom btrfs
> > semantics. Having mounts in /proc/<pid>/mountinfo that you can't unmount
> > only happens in weird corner cases today:
> >
> > * mounts inherited during unprivileged mount namespace creation
> > * locked mounts
> >
> > Both of which are pretty inelegant and effectively only exist because of
> > user namespaces. So if we can avoid proliferating such semantics it
> > would be preferable.
> >
> > I think it would also be rather confusing for userspace to be presented
> > with a bunch of mounts in /proc/<pid>/mountinfo that it can't do
> > anything with.
> >
> > > > (2) They are proper vfsmounts:
> > > >
> > > >      umount /mnt/subvol1 # succeeds
> > > >
> > > >      This retains standard semantics for userspace about anything that
> > > >      shows up in /proc/<pid>/mountinfo but means that after
> > > >      umount /mnt/subvol1 succeeds, /mnt/subvol1 won't be accessible from
> > > >      the filesystem root /mnt anymore.
> > > >
> > > > Both options can be made to work from a purely technical perspective,
> > > > I'm asking which one it has to be because it isn't clear just from the
> > > > snippets in this thread.
> > > >
> > > > One should also point out that if each subvolume is a vfsmount, then say
> > > > a btrfs filesystems with 1000 subvolumes which is mounted from the root:
> > > >
> > > > mount -t btrfs /dev/sda /mnt
> > > >
> > > > could be exploded into 1000 individual mounts. Which many users might not want.
> > >
> > > Can we make it dynamic? AKA, the btrfs_insert_fs_root() is the perfect
> > > timing here.
> >
> > Probably, it would be an automount. Though I would have to recheck that
> > code to see how exactly that would work but roughly, when you add the
> > inode for the subvolume you raise S_AUTOMOUNT on it and then you add
> > .d_automount for btrfs.
>
> Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks
> it does.
>
> However I ran into some weirdness where I need to support the new mount API, so
> that's what I've been doing since I wandered away from this thread.  I should
> have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop
> of that.
>
> But I have the same questions as you Christian, I'm not entirely sure how this
> is supposed to be better.  Even if they show up in /proc/mounts, it's not going
> to do anything useful for the applications that don't check /proc/mounts to see
> if they've wandered into a new mount.  I also don't quite understand how NFS
> suddenly knows it's wandered into a new mount with a vfsmount.
>

IIUC, the NFS server communicated the same/different filesystem to the client
by means of an FSID. This is not the same f_fsid from statfs(), it's the FSID
communicated to nfsd server from /etc/exports or guessed by nfsd server
for blockdev fs. IIRC, the NFS FSID is 128bit (?).

If we implemented the s_op->get_fsid() method that Jan has suggested [1],
we could make it mean -
"nfsd, please use this as the NFS FSID instead of assuming the all inodes on
 this sb should be presented to the client as a uniform FSID".

[1] https://lore.kernel.org/linux-fsdevel/20231025135048.36153-2-amir73il@xxxxxxxxx/

> At this point I'm tired of it being brought up in every conversation where we
> try to expose more information to the users.  So I'll write the patches and as
> long as they don't break anything we can merge it, but I don't think it'll make
> a single bit of difference.
>
> We'll be converted to the new mount API tho, so I suppose that's something.

Horray!

I had just noticed that since Monday, we have a new fs on the block with
multiple subvol on the same sb - bcachefs.

I took a look at bch2_statfs(), bch2_encode_fh() and bch2_getattr() and
I can't help wondering, what's different from btrfs?

Kent,

Is inode->v.i_ino unique for all inodes in bcachefs sb
and inode->ei_inode.bi_inum unique within a subvol?
If so, how is a cloned inode number allocated?

BTW1: bch2_statfs() open codes uuid_to_fsid().
BTW2: Any reason why bcache fs uses  bch_sb.uuid and not sb.s_uuid?

If you publish bch_sb.uuid to vfs via sb.s_uuid, you can use
bcachefs as a layer in overlayfs for advance features like
NFS export of overlayfs - with s_uuid, those features will not
be available for overlayfs over bcachefs.

Thanks,
Amir.