On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote: > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote: > > But at that point we really need to ask if it makes sense to use > > vfsmounts per subvolume in the first place: > > > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts. > > (2) By calling ->getattr() from show_mountinfo() we open the whole > > system up to deadlocks. > > (3) We change btrfs semantics drastically to the point where they need a > > new mount, module, or Kconfig option. > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight > > because you need to create a mount for the subvolume. > > > > So right now, I don't see how we can make this work even if the concept > > doesn't seem necessarily wrong. > > How else do you want to solve it? Crossing a mount point is the > only legitimate boundary for changing st_dev and having a new inode > number space. And we can't fix that retroactively. I think the idea of using vfsmounts for this makes some sense if the goal is to retroactively justify and accommodate the idea that a subvolume is to be treated as equivalent to a separate device. I question that premise though. I think marking them with separate device numbers is bringing us nothing but pain at this point and this solution is basically bending the vfs to make that work somehow. And the worst thing is that I think that treating subvolumes like vfsmounts will hurt vfsmounts more than it will hurt subvolumes. Right now all that vfsmounts technically are is a topological abstraction on top of filesystem objects such as files, directories, sockets, even devices that are exposed as filesystems objects. None of them get to muck with core properties of what a vfsmount is though. Additionally, vfsmount are tied to a superblock and share the device numbers with the superblock they belong to. If we make subvolumes and vfsmounts equivalent we break both properties. And I think that's wrong or at least really ugly. And I already see that the suggested workaround for (2) will somehow end up being stashing device numbers in struct mount or struct vfsmount so we can show it in mountinfo and if that's the case I want to express a proactive nak for that solution. The way I see it is that a subvolume at the end is nothing but a subclass of directories a special one but whatever. I would feel much more comfortable if the two filesystems that expose these objects give us something like STATX_SUBVOLUME that userspace can raise in the request mask of statx(). If userspace requests STATX_SUBVOLUME in the request mask, the two filesystems raise STATX_SUBVOLUME in the statx result mask and then also return the _real_ device number of the superblock and stop exposing that made up device number. This can be accompanied by a vfs ioctl that is identical for both btrfs and bcachefs and returns $whatever unique property to mark the inode space of the subvolume. And then we leave innocent vfs objects alone and we also avoid bringing in all that heavy vfsmount machinery on top of subvolumes.