Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 6 Nov 2023 10:52:24 +0100

On Mon, Nov 06, 2023 at 10:03:55AM +0100, Jan Kara wrote:
> On Fri 03-11-23 16:47:02, Christian Brauner wrote:
> > On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> > > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > > > But at that point we really need to ask if it makes sense to use
> > > > vfsmounts per subvolume in the first place:
> > > > 
> > > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > > > (2) By calling ->getattr() from show_mountinfo() we open the whole
> > > >     system up to deadlocks.
> > > > (3) We change btrfs semantics drastically to the point where they need a
> > > >     new mount, module, or Kconfig option.
> > > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> > > >     because you need to create a mount for the subvolume.
> > > > 
> > > > So right now, I don't see how we can make this work even if the concept
> > > > doesn't seem necessarily wrong.
> > > 
> > > How else do you want to solve it?  Crossing a mount point is the
> > > only legitimate boundary for changing st_dev and having a new inode
> > > number space.  And we can't fix that retroactively.
> > 
> > I think the idea of using vfsmounts for this makes some sense if the
> > goal is to retroactively justify and accommodate the idea that a
> > subvolume is to be treated as equivalent to a separate device.
> > 
> > I question that premise though. I think marking them with separate
> > device numbers is bringing us nothing but pain at this point and this
> > solution is basically bending the vfs to make that work somehow.
> > 
> > And the worst thing is that I think that treating subvolumes like
> > vfsmounts will hurt vfsmounts more than it will hurt subvolumes.
> > 
> > Right now all that vfsmounts technically are is a topological
> > abstraction on top of filesystem objects such as files, directories,
> > sockets, even devices that are exposed as filesystems objects. None of
> > them get to muck with core properties of what a vfsmount is though.
> > 
> > Additionally, vfsmount are tied to a superblock and share the device
> > numbers with the superblock they belong to.
> > 
> > If we make subvolumes and vfsmounts equivalent we break both properties.
> > And I think that's wrong or at least really ugly.
> > 
> > And I already see that the suggested workaround for (2) will somehow end
> > up being stashing device numbers in struct mount or struct vfsmount so
> > we can show it in mountinfo and if that's the case I want to express a
> > proactive nak for that solution.
> > 
> > The way I see it is that a subvolume at the end is nothing but a
> > subclass of directories a special one but whatever.
> 
> As far as I understand the problem, subvolumes indeed seem closer to
> special directories than anything else. They slightly resemble what ext4 &
> xfs implement with project quotas (were each inode can have additional
> recursively inherited "project id"). What breaks this "special directory"
> kind of view for btrfs is that subvolumes have overlapping inode numbers.
> Since we don't seem to have a way of getting out of the current situation
> in a "seamless" way anyway, I wonder if implementing a btrfs feature to
> provide unique inode numbers across all subvolumes would not be the
> cleanest way out...
> 
> > I would feel much more comfortable if the two filesystems that expose
> > these objects give us something like STATX_SUBVOLUME that userspace can
> > raise in the request mask of statx().
> > 
> > If userspace requests STATX_SUBVOLUME in the request mask, the two
> > filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> > return the _real_ device number of the superblock and stop exposing that
> > made up device number.
> > 
> > This can be accompanied by a vfs ioctl that is identical for both btrfs
> > and bcachefs and returns $whatever unique property to mark the inode
> > space of the subvolume.
> > 
> > And then we leave innocent vfs objects alone and we also avoid
> > bringing in all that heavy vfsmount machinery on top of subvolumes.
> 
> Well, but this requires application knowledge of a new type of object - a
> subvolume. So you'd have to teach all applications that try to identify
> whether two "filenames" point to the same object or not about this and that
> seems like a neverending story. Hence either we will live with fake devices

But that is what's happening today already, no? All tools need to figure
out that they are on a btrfs subvolume somehow whenever they want to do
something meaningful to it. systemd code is full of special btrfs
handling code.

I don't understand why we're bending and breaking ourselves to somehow
make a filesystem specific, special object fit into standard apis when
it clearly breaks standard apis?