Re: Possible bug with open between unshare(CLONE_NEWNS) calls

Qu Wenruo <quwenruo.btrfs@xxxxxxx> · Fri, 17 Jan 2025 07:39:09 +1030

在 2025/1/16 21:16, Christian Brauner 写道:
On Wed, Jan 15, 2025 at 10:56:08AM -0800, Boris Burkov wrote:
Hello,

If we run the following C code:

unshare(CLONE_NEWNS);
int fd = open("/dev/loop0", O_RDONLY)
unshare(CLONE_NEWNS);

Then after the second unshare, the mount hierarchy created by the first
unshare is fully dereferenced and gets torn down, leaving the file
pointed to by fd with a broken dentry.

Specifically, subsequent calls to d_path on its path resolve to
"/loop0". I was able to confirm this with drgn, and it has caused an
unexpected failure in mkosi/systemd-repart attempting to mount a btrfs
filesystem through such an fd, since btrfs uses d_path to resolve the
source device file path fully.

I confirmed that this is definitely due to the first unshare mount
namespace going away by:
1. printks/bpftrace the copy_root path in the kernel
2. rewriting my test program to fork after the first unshare to keep
that namespace referenced. In this case, the fd is not broken after the
second unshare.

My question is:
Is this expected behavior with respect to mount reference counts and
namespace teardown?

If I mount a filesystem and have a running program with an open file
descriptor in that filesystem, I would expect unmounting that filesystem
to fail with EBUSY, so it stands to reason that the automatic unmount
that happens from tearing down the mount namespace of the first unshare
should respect similar semantics and either return EBUSY or at least
have the lazy umount behavior and not wreck the still referenced mount
objects.

If this behavior seems like a bug to people better versed in the
expected behavior of namespaces, I would be happy to work on a fix.

It's expected as Al already said. And is_good_dev_path()
looks pretty hacky...

Wouldn't something like:

bool is_devtmpfs(const struct super_block *sb)
{
         return sb->s_type == &dev_fs_type;
}

and then:

         ret = kern_path(dev_path, 0, &path);
         if (ret)
                 goto out;

	if (is_devtmpfs(path->mnt->mnt_sb))
		// something something

be enough? Or do you specifically need to care where devtmpfs is
mounted? The current check means that anything that mounts devtmpfs
somewhere other than /dev would fail that check.

That above checks looks good.

Of course, any standard Linux distribution will mount devtmpfs at /dev
so it probably won't matter in practice. And contains may make /dev a
tmpfs mount and bind-mount device nodes in from the host's devtmpfs so
that would work too with this check.

In other words, I don't get why the /dev prefix check gets you anything?
If you just verify that the device node is located on devtmpfs you
should be good.

The original problem is that we can get very weird device path, like
'/proc/<pid>/<fd>' or any blockdev node created by the end user, as
mount source, which can cause various problems in mount_info for end users.

Although after v6.8 it looks like there are some other black magics
involved to prevent such block device being passed in.
I tried the same custom block device node, it always resolves to
"/dev/mapper/test-scratch1" in my case (and not even "/dev/dm-3").

However there is still another problem, related to get_canonical_dev_path().

As it still goes d_path(), it will return the path inside the namespace.
Which can be very different from root namespace.

So I'm wondering if we should even bother the device path resolution at
all inside btrfs?
Or the latest fsconfig API is already resolving the path correctly?

Thanks,
Qu