On Mon, Nov 06, 2023 at 04:29:23AM -0800, Christoph Hellwig wrote: > On Mon, Nov 06, 2023 at 11:03:37AM +0100, Christian Brauner wrote: > > But why do we care? > > Current code already does need to know it is on a btrfs subvolume. They > > all know that btrfs subvolumes are special. > > "they all know" is a bit vague. How do you know "all" code knows? Granted, an over-generalization but non in any way different from claiming that currently on one needs to know about btrfs subvolumes or that the proposed vfsmount solution will make it magically so that no one needs to care anymore. Tools will have to change either way is my point. And a lot of tools do already handle subvolumes specially exactly because of the non-unique inode situation. And if they don't they still can get confused by seing st_dev numbers they can't associate with a filesystem. > > They will need to know that > > btrfs subvolumes are special in the future even if they were vfsmounts. > > They would likely end up with another kind of confusion because suddenly > > vfsmounts have device numbers that aren't associated with the superblock > > that vfsmount belongs to. > > Let's take a step back. Posix says st_ino is uniqueue for a given > st_dev, and per posix a mount mount is defined as any file that > has a different st_dev from the parent. So by the Posix definition > btrfs subvolume roots are mount points, which is am obvios clash > with the Linux definition based on vfsmounts. 3.229 Mount Point Either the system root directory or a directory for which the st_dev field of structure stat differs from that of its parent directory. I think that's just an argument against mapping subvolumes to vfsmounts. Because bind-mounts don't change the device number - and they very much shouldn't. > > > > > If userspace requests STATX_SUBVOLUME in the request mask, the two > > > > filesystems raise STATX_SUBVOLUME in the statx result mask and then also > > > > return the _real_ device number of the superblock and stop exposing that > > > > made up device number. > > > > > > What is a "real" device number? > > > > The device number of the superblock of the btrfs filesystem and not some > > made-up device number. > > The block device st_dev is just as made up. > > > I care about not making a btrfs specific problem the vfs's problem by > > hoisting that whole problem space a level up by mapping subvolumes to > > vfsmounts. > > While I'd love to fix it, and evern more not have more of this > crap sneak in (*cough* bcachefs, *cough*). І'm ok with that stance. > But that also means we can't let this creep into the vfs by other > means, which is what started the thread. The thing is I'm not even sure there's anything to fix. This discussion started with btrfs maybe getting an alternative way to uniquify an inode independent of st_dev. I'm not sure that is such a massive problem. If we give both btrfs and bcachefs a single flag in statx() that allows _interested_ userspace to query whether a file is located on a subvolume that shouldn't be a problem (We have STATX_ATTR_* which identifies additional properties that are restricted to few filesytems). And all the specific gobbledigook can be implemented as an ioctl() - ideally both btrfs and bcachefs agree on something - that the vfs doesn't have to care about at all. I genuinely don't care if they report a fake st_dev from stat(). I genuinely _do_ care that we don't make vfsmounts privy to this. Let alone that automounts are a giant paint. Not just do they iirc allow to create shadow mounts, they also interact with namespace and container creation. If you spawn thousands of containers each with a private mount namespace - which is the default - you now trigger automounts in thousands of containers when triggering a lookup on btrfs. If you have mount propagation turned on each automount may also propagate into god knows how many other mount namespaces. That's just nasty. IOW, making subvolumes vfsmounts will also have wider semantic implications for using btrfs as a filesystem.