Re: Possible bug with open between unshare(CLONE_NEWNS) calls

Christian Brauner <brauner@xxxxxxxxxx> · Mon, 20 Jan 2025 16:37:54 +0100

On Fri, Jan 17, 2025 at 07:39:09AM +1030, Qu Wenruo wrote:
> 
> 
> 在 2025/1/16 21:16, Christian Brauner 写道:
> > On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
> > > Hello,
> > > 
> > > If we run the following C code:
> > > 
> > > unshare(CLONE_NEWNS);
> > > int fd = open("/dev/loop0", O_RDONLY)
> > > unshare(CLONE_NEWNS);
> > > 
> > > Then after the second unshare, the mount hierarchy created by the first
> > > unshare is fully dereferenced and gets torn down, leaving the file
> > > pointed to by fd with a broken dentry.
> > > 
> > > Specifically, subsequent calls to d_path on its path resolve to
> > > "/loop0". I was able to confirm this with drgn, and it has caused an
> > > unexpected failure in mkosi/systemd-repart attempting to mount a btrfs
> > > filesystem through such an fd, since btrfs uses d_path to resolve the
> > > source device file path fully.
> > > 
> > > I confirmed that this is definitely due to the first unshare mount
> > > namespace going away by:
> > > 1. printks/bpftrace the copy_root path in the kernel
> > > 2. rewriting my test program to fork after the first unshare to keep
> > > that namespace referenced. In this case, the fd is not broken after the
> > > second unshare.
> > > 
> > > 
> > > My question is:
> > > Is this expected behavior with respect to mount reference counts and
> > > namespace teardown?
> > > 
> > > If I mount a filesystem and have a running program with an open file
> > > descriptor in that filesystem, I would expect unmounting that filesystem
> > > to fail with EBUSY, so it stands to reason that the automatic unmount
> > > that happens from tearing down the mount namespace of the first unshare
> > > should respect similar semantics and either return EBUSY or at least
> > > have the lazy umount behavior and not wreck the still referenced mount
> > > objects.
> > > 
> > > If this behavior seems like a bug to people better versed in the
> > > expected behavior of namespaces, I would be happy to work on a fix.
> > 
> > It's expected as Al already said. And is_good_dev_path()
> > looks pretty hacky...
> > 
> > Wouldn't something like:
> > 
> > bool is_devtmpfs(const struct super_block *sb)
> > {
> >          return sb->s_type == &dev_fs_type;
> > }
> > 
> > and then:
> > 
> >          ret = kern_path(dev_path, 0, &path);
> >          if (ret)
> >                  goto out;
> > 
> > 	if (is_devtmpfs(path->mnt->mnt_sb))
> > 		// something something
> > 
> > be enough? Or do you specifically need to care where devtmpfs is
> > mounted? The current check means that anything that mounts devtmpfs
> > somewhere other than /dev would fail that check.
> 
> That above checks looks good.
> 
> > 
> > Of course, any standard Linux distribution will mount devtmpfs at /dev
> > so it probably won't matter in practice. And contains may make /dev a
> > tmpfs mount and bind-mount device nodes in from the host's devtmpfs so
> > that would work too with this check.
> > 
> > In other words, I don't get why the /dev prefix check gets you anything?
> > If you just verify that the device node is located on devtmpfs you
> > should be good.
> 
> The original problem is that we can get very weird device path, like
> '/proc/<pid>/<fd>' or any blockdev node created by the end user, as
> mount source, which can cause various problems in mount_info for end users.
> 
> Although after v6.8 it looks like there are some other black magics
> involved to prevent such block device being passed in.
> I tried the same custom block device node, it always resolves to
> "/dev/mapper/test-scratch1" in my case (and not even "/dev/dm-3").
> 
> 
> However there is still another problem, related to get_canonical_dev_path().
> 
> As it still goes d_path(), it will return the path inside the namespace.
> Which can be very different from root namespace.

I consider the source device as shown in mountinfo a hint and not more.
It's entirely possible that someone does:

mount /dev/sdd4 /mnt
mv /dev/sdd4 /dev/foo
mv /dev/sdd3 /dev/sdd4

That just usually doesn't happen because userspace isn't generally
stupid and devtmpfs device nodes are mostly managed by the kernel.

But in containers you can easily have an equivalent scenario and
whatever shows up in /dev in the container can be something completely
different than what you see in mountinfo.

So really, relying on the path name is overall pretty useless.

These mismatches between mountinfo and whatever is in /dev isn't
anything new. Containers may use devpts devices that belong to the host
devpts instance while also having a separate devpts instance mounted in
the container. And the device names just accidently match.

The container/host needs to take care to validate that the provided pty
device it uses does actually belong to the devpts instance of the
container/host and not to another instance to avoid being tricked into
opening and allocating device nodes in another devpts instance.

> 
> So I'm wondering if we should even bother the device path resolution at
> all inside btrfs?

I think that's misguided and will always lead to weird issues.

> Or the latest fsconfig API is already resolving the path correctly?

It wouldn't help with the above examples.