On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote: > Hello, > > If we run the following C code: > > unshare(CLONE_NEWNS); > int fd = open("/dev/loop0", O_RDONLY) > unshare(CLONE_NEWNS); > > Then after the second unshare, the mount hierarchy created by the first > unshare is fully dereferenced and gets torn down, leaving the file > pointed to by fd with a broken dentry. > > Specifically, subsequent calls to d_path on its path resolve to > "/loop0". I was able to confirm this with drgn, and it has caused an > unexpected failure in mkosi/systemd-repart attempting to mount a btrfs > filesystem through such an fd, since btrfs uses d_path to resolve the > source device file path fully. > > I confirmed that this is definitely due to the first unshare mount > namespace going away by: > 1. printks/bpftrace the copy_root path in the kernel > 2. rewriting my test program to fork after the first unshare to keep > that namespace referenced. In this case, the fd is not broken after the > second unshare. > > > My question is: > Is this expected behavior with respect to mount reference counts and > namespace teardown? > > If I mount a filesystem and have a running program with an open file > descriptor in that filesystem, I would expect unmounting that filesystem > to fail with EBUSY, so it stands to reason that the automatic unmount > that happens from tearing down the mount namespace of the first unshare > should respect similar semantics and either return EBUSY or at least > have the lazy umount behavior and not wreck the still referenced mount > objects. > > If this behavior seems like a bug to people better versed in the > expected behavior of namespaces, I would be happy to work on a fix. It's expected as Al already said. And is_good_dev_path() looks pretty hacky... Wouldn't something like: bool is_devtmpfs(const struct super_block *sb) { return sb->s_type == &dev_fs_type; } and then: ret = kern_path(dev_path, 0, &path); if (ret) goto out; if (is_devtmpfs(path->mnt->mnt_sb)) // something something be enough? Or do you specifically need to care where devtmpfs is mounted? The current check means that anything that mounts devtmpfs somewhere other than /dev would fail that check. Of course, any standard Linux distribution will mount devtmpfs at /dev so it probably won't matter in practice. And contains may make /dev a tmpfs mount and bind-mount device nodes in from the host's devtmpfs so that would work too with this check. In other words, I don't get why the /dev prefix check gets you anything? If you just verify that the device node is located on devtmpfs you should be good.