Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

Josef Bacik <josef@xxxxxxxxxxxxxx> · Fri, 27 Oct 2023 09:17:26 -0400

On Thu, Oct 26, 2023 at 10:46:01PM -0700, Christoph Hellwig wrote:
> I think you're missing the point.  A bunch of statx fields might be
> useful, but they are not solving the problem.  What you need is
> a separate vfsmount per subvolume so that userspace sees when it
> is crossing into it.  We probably can't force this onto existing
> users, so it needs a mount, or even better on-disk option but without
> that we're not getting anywhere.
> 

We have this same discussion every time, and every time you stop responding
after I point out the problems with it.

A per-subvolume vfsmount means that /proc/mounts /proc/$PID/mountinfo becomes
insanely dumb.  I've got millions of machines in this fleet with thousands of
subvolumes.  One of our workloads fires up several containers per task and runs
multiple tasks per machine, so on the order of 10-20k subvolumes.

So now I've got thousands of entries in /proc/mounts, and literally every system
related tool parses /proc/mounts every 4 nanoseconds, now I'm significantly
contributing to global warming from the massive amount of CPU usage that is
burned parsing this stupid file.

Additionally, now you're ending up with potentially sensitive information being
leaked through /proc/mounts that you didn't expect to be leaked before.  I've
got users complaining to be me because "/home/john/twilight_fanfic" showed up in
their /proc/mounts.

And then there's the expiry thing.  Now they're just directories, reclaim works
like it works for anything else.  With auto mounts they have to expire at some
point, which makes them so much more heavier weight than we want to sign up for.
Who knows what sort of scalability issues we'll run into.

There were some internal related things that went wrong with this when I tried a
decade ago, I'm sure I could fix that by changing vfsmount, so I don't see that
as a real blocker, but it's not as straightforward as just doing it.

I have to support this file system in the real world, with real world stupidity
happening that I can't control.  I wholeheartedly agree that the statx fields
are not a direct fix, it's a comprimise.  It's a way forward to let the users
who care about the distinction be able to get the information they need to make
better decisions about what to do when they run into btrfs's weirdness.  It
doesn't solve the st_dev problem today, or even for a couple of years, but it
gives us a way to eventually change the st_dev thing.  Thanks,

Josef