On Mon, Nov 06, 2023 at 10:03:55AM +0100, Jan Kara wrote: > On Fri 03-11-23 16:47:02, Christian Brauner wrote: > > On Fri, Nov 03, 2023 at 07:28:42AM -0700, Christoph Hellwig wrote: > > > On Thu, Nov 02, 2023 at 12:07:47PM +0100, Christian Brauner wrote: > > > > But at that point we really need to ask if it makes sense to use > > > > vfsmounts per subvolume in the first place: > > > > > > > > (1) We pollute /proc/<pid>/mountinfo with a lot of mounts. > > > > (2) By calling ->getattr() from show_mountinfo() we open the whole > > > > system up to deadlocks. > > > > (3) We change btrfs semantics drastically to the point where they need a > > > > new mount, module, or Kconfig option. > > > > (4) We make (initial) lookup on btrfs subvolumes more heavyweight > > > > because you need to create a mount for the subvolume. > > > > > > > > So right now, I don't see how we can make this work even if the concept > > > > doesn't seem necessarily wrong. > > > > > > How else do you want to solve it? Crossing a mount point is the > > > only legitimate boundary for changing st_dev and having a new inode > > > number space. And we can't fix that retroactively. > > > > I think the idea of using vfsmounts for this makes some sense if the > > goal is to retroactively justify and accommodate the idea that a > > subvolume is to be treated as equivalent to a separate device. > > > > I question that premise though. I think marking them with separate > > device numbers is bringing us nothing but pain at this point and this > > solution is basically bending the vfs to make that work somehow. > > > > And the worst thing is that I think that treating subvolumes like > > vfsmounts will hurt vfsmounts more than it will hurt subvolumes. > > > > Right now all that vfsmounts technically are is a topological > > abstraction on top of filesystem objects such as files, directories, > > sockets, even devices that are exposed as filesystems objects. None of > > them get to muck with core properties of what a vfsmount is though. > > > > Additionally, vfsmount are tied to a superblock and share the device > > numbers with the superblock they belong to. > > > > If we make subvolumes and vfsmounts equivalent we break both properties. > > And I think that's wrong or at least really ugly. > > > > And I already see that the suggested workaround for (2) will somehow end > > up being stashing device numbers in struct mount or struct vfsmount so > > we can show it in mountinfo and if that's the case I want to express a > > proactive nak for that solution. > > > > The way I see it is that a subvolume at the end is nothing but a > > subclass of directories a special one but whatever. > > As far as I understand the problem, subvolumes indeed seem closer to > special directories than anything else. They slightly resemble what ext4 & > xfs implement with project quotas (were each inode can have additional > recursively inherited "project id"). What breaks this "special directory" > kind of view for btrfs is that subvolumes have overlapping inode numbers. > Since we don't seem to have a way of getting out of the current situation > in a "seamless" way anyway, I wonder if implementing a btrfs feature to > provide unique inode numbers across all subvolumes would not be the > cleanest way out... > > > I would feel much more comfortable if the two filesystems that expose > > these objects give us something like STATX_SUBVOLUME that userspace can > > raise in the request mask of statx(). > > > > If userspace requests STATX_SUBVOLUME in the request mask, the two > > filesystems raise STATX_SUBVOLUME in the statx result mask and then also > > return the _real_ device number of the superblock and stop exposing that > > made up device number. > > > > This can be accompanied by a vfs ioctl that is identical for both btrfs > > and bcachefs and returns $whatever unique property to mark the inode > > space of the subvolume. > > > > And then we leave innocent vfs objects alone and we also avoid > > bringing in all that heavy vfsmount machinery on top of subvolumes. > > Well, but this requires application knowledge of a new type of object - a > subvolume. So you'd have to teach all applications that try to identify > whether two "filenames" point to the same object or not about this and that > seems like a neverending story. Hence either we will live with fake devices But that is what's happening today already, no? All tools need to figure out that they are on a btrfs subvolume somehow whenever they want to do something meaningful to it. systemd code is full of special btrfs handling code. I don't understand why we're bending and breaking ourselves to somehow make a filesystem specific, special object fit into standard apis when it clearly breaks standard apis?