Re: [PATCH 0/3] fanotify support for btrfs sub-volumes

Jan Kara <jack@xxxxxxx> · Mon, 6 Nov 2023 10:03:55 +0100

On Fri 03-11-23 16:47:02, Christian Brauner wrote:
> On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote:
> > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote:
> > > But at that point we really need to ask if it makes sense to use
> > > vfsmounts per subvolume in the first place:
> > > 
> > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts.
> > > (2) By calling ->getattr() from show_mountinfo() we open the whole
> > >     system up to deadlocks.
> > > (3) We change btrfs semantics drastically to the point where they need a
> > >     new mount, module, or Kconfig option.
> > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight
> > >     because you need to create a mount for the subvolume.
> > > 
> > > So right now, I don't see how we can make this work even if the concept
> > > doesn't seem necessarily wrong.
> > 
> > How else do you want to solve it?  Crossing a mount point is the
> > only legitimate boundary for changing st_dev and having a new inode
> > number space.  And we can't fix that retroactively.
> 
> I think the idea of using vfsmounts for this makes some sense if the
> goal is to retroactively justify and accommodate the idea that a
> subvolume is to be treated as equivalent to a separate device.
> 
> I question that premise though. I think marking them with separate
> device numbers is bringing us nothing but pain at this point and this
> solution is basically bending the vfs to make that work somehow.
> 
> And the worst thing is that I think that treating subvolumes like
> vfsmounts will hurt vfsmounts more than it will hurt subvolumes.
> 
> Right now all that vfsmounts technically are is a topological
> abstraction on top of filesystem objects such as files, directories,
> sockets, even devices that are exposed as filesystems objects. None of
> them get to muck with core properties of what a vfsmount is though.
> 
> Additionally, vfsmount are tied to a superblock and share the device
> numbers with the superblock they belong to.
> 
> If we make subvolumes and vfsmounts equivalent we break both properties.
> And I think that's wrong or at least really ugly.
> 
> And I already see that the suggested workaround for (2) will somehow end
> up being stashing device numbers in struct mount or struct vfsmount so
> we can show it in mountinfo and if that's the case I want to express a
> proactive nak for that solution.
> 
> The way I see it is that a subvolume at the end is nothing but a
> subclass of directories a special one but whatever.

As far as I understand the problem, subvolumes indeed seem closer to
special directories than anything else. They slightly resemble what ext4 &
xfs implement with project quotas (were each inode can have additional
recursively inherited "project id"). What breaks this "special directory"
kind of view for btrfs is that subvolumes have overlapping inode numbers.
Since we don't seem to have a way of getting out of the current situation
in a "seamless" way anyway, I wonder if implementing a btrfs feature to
provide unique inode numbers across all subvolumes would not be the
cleanest way out...

> I would feel much more comfortable if the two filesystems that expose
> these objects give us something like STATX_SUBVOLUME that userspace can
> raise in the request mask of statx().
> 
> If userspace requests STATX_SUBVOLUME in the request mask, the two
> filesystems raise STATX_SUBVOLUME in the statx result mask and then also
> return the _real_ device number of the superblock and stop exposing that
> made up device number.
> 
> This can be accompanied by a vfs ioctl that is identical for both btrfs
> and bcachefs and returns $whatever unique property to mark the inode
> space of the subvolume.
> 
> And then we leave innocent vfs objects alone and we also avoid
> bringing in all that heavy vfsmount machinery on top of subvolumes.

Well, but this requires application knowledge of a new type of object - a
subvolume. So you'd have to teach all applications that try to identify
whether two "filenames" point to the same object or not about this and that
seems like a neverending story. Hence either we will live with fake devices
on btrfs forever or we need to find some other solution to "inode numbers
across subvolumes overlap" problem within "standard unix" APIs.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR