> Btw I'm working on this, mostly to show Christoph it doesn't do what he thinks > it does. > > However I ran into some weirdness where I need to support the new mount API, so > that's what I've been doing since I wandered away from this thread. I should > have that done tomorrow, and then I was going to do the S_AUTOMOUNT thing ontop > of that. > > But I have the same questions as you Christian, I'm not entirely sure how this > is supposed to be better. Even if they show up in /proc/mounts, it's not going > to do anything useful for the applications that don't check /proc/mounts to see > if they've wandered into a new mount. I also don't quite understand how NFS > suddenly knows it's wandered into a new mount with a vfsmount. So the subvolumes-as-vfsmount solution was implemented already a few years ago. I looked at that patchset and the crucial point in the solution was patch [1]. show_mountinfo() is called under namespace_sem (see fs/namespace.c). That thing is crucial for all mount namespaces, mount propagation and so on. We can't cause IO under that unless we want to allow to trivially deadlock the whole system by tricking us into talking to an unresponsive NFS server or similar. And all vfs_getattr*() flavours can legitimately cause IO even with AT_STATX_DONT_SYNC. So exposing this via /proc/<pid>/mountinfo doesn't work. But that means even if you make it a separate vfsmount you need to epose the device information through another interface. But at that point we really need to ask if it makes sense to use vfsmounts per subvolume in the first place: (1) We pollute /proc/<pid>/mountinfo with a lot of mounts. (2) By calling ->getattr() from show_mountinfo() we open the whole system up to deadlocks. (3) We change btrfs semantics drastically to the point where they need a new mount, module, or Kconfig option. (4) We make (initial) lookup on btrfs subvolumes more heavyweight because you need to create a mount for the subvolume. So right now, I don't see how we can make this work even if the concept doesn't seem necessarily wrong. Even if we were to go through with this and make each subvolume a vfsmount but then don't expose the ->getattr() device numbers in /proc/<pid>/mountinfo but instead add a separate retrieval method via statx() we'd be creating even more confusion for userspace by showing different device numbers in /proc/<pid>/mountinfo than in statx(). [1]: Subject: [PATCH 01/11] VFS: show correct dev num in mountinfo Date: Wed, 28 Jul 2021 08:37:45 +1000 Message-ID: <162742546548.32498.10889023150565429936.stgit@noble.brown> /proc/$PID/mountinfo contains a field for the device number of the filesystem at each mount. This is taken from the superblock ->s_dev field, which is correct for every filesystem except btrfs. A btrfs filesystem can contain multiple subvols which each have a different device number. If (a directory within) one of these subvols is mounted, the device number reported in mountinfo will be different from the device number reported by stat(). This confuses some libraries and tools such as, historically, findmnt. Current findmnt seems to cope with the strangeness. So instead of using ->s_dev, call vfs_getattr_nosec() and use the ->dev provided. As there is no STATX flag to ask for the device number, we pass a request mask for zero, and also ask the filesystem to avoid syncing with any remote service. Signed-off-by: NeilBrown <neilb@xxxxxxx> --- fs/proc_namespace.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 392ef5162655..f342a0231e9e 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -138,10 +138,16 @@ static int show_mountinfo(struct seq_file *m, struct vfsmount *mnt) struct mount *r = real_mount(mnt); struct super_block *sb = mnt->mnt_sb; struct path mnt_path = { .dentry = mnt->mnt_root, .mnt = mnt }; + struct kstat stat; int err; + /* We only want ->dev, and there is no STATX flag for that, + * so ask for nothing and assume we get ->dev + */ + vfs_getattr_nosec(&mnt_path, &stat, 0, AT_STATX_DONT_SYNC); + seq_printf(m, "%i %i %u:%u ", r->mnt_id, r->mnt_parent->mnt_id, - MAJOR(sb->s_dev), MINOR(sb->s_dev)); + MAJOR(stat.dev), MINOR(stat.dev)); if (sb->s_op->show_path) { err = sb->s_op->show_path(m, mnt->mnt_root); if (err)