Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Nov 29, 2023 at 10:38:48AM -0600, Rob Landley wrote:
Because nobody's ever wanted to fix chroot() so mkdir("sub", 0777);
chroot("sub"); chdir("../../../../.."); chroot("."); wouldn't escape it

I
would have thought you could use "mount --move . /" to nerf the cd ../../.. but
for some reason it didn't work (I forget why) and nobody wanted to fix that either.

Actually move mounting the desired new root over the top of the old does mitigate the chroot & chdir attack. The main reason, I believe, that the runtime maintainers don't like that option is that, despite being "inaccessible", the old mount tree still exists in the container's mount namespace. This has led to issues such as the sysfs/procfs issue [1]: In summary, that attack worked by a process within a container creating a new userns, and giving that CAP_SYS_ADMIN.

In such cases, the kernel had protections in place to ensure that, even with the SYS_ADMIN capability, the process in the new userns wasn't allowed to mount proc or sysfs, unless a fully visible mount of proc/sysfs already exists in the process' mount namespace.

There was a bug in the kernel's visibility check - it checked if each instance of proc/sysfs in the mount namespace had been over mounted, or any of this subdirectories had, but forgot to check if any of its root directories had. This resulted in the original root's /proc / /sys mounts counting as visible, even though they weren't, which allowed the child userns to mount a fully unmasked instance and gain access to things it shouldn't.

Now, this was fixed in 7e96c1b0e0f495 however my assumption, and I don't want to speak on behalf of all runtime maintainers here, is that the advice to prefer pivot_root is because of the increased risk of bugs like these. When using pivot_root, the old root is able to be completely unmounted from the container's mount namespace after the pivot which, from a security perspective, gives better peace of mind.

There is at least one other fringe exploit that I am aware of when running in containers not using pivot root - this involves process 1 within a container unmounting its root with MNT_DETACH. While this doesn't always allow that process itself to break out fully, it does allow subsequent calls to exec within the container to leak information about the host's root file system. This would not occur with pivot_root.

[1]: https://github.com/opencontainers/runc/pull/1962

If you don't use rootfs you don't have to empty it, yes.

The point I meant was that this brings the initramfs flow in line with the other root approaches: for initrd, kernel handled root= mounts, and initramfs switch_root setups, rootfs exists because it has to at the top of the stack. For initramfs embedded systems, rootfs exists because it is the root - embedded linux actually using the rootfs as a root is the outlying behaviour.

You could use an old-style initrd which would be mounted over the root
filesystem and which you could switch_root away from and then unmount.

You could, but isn't initramfs a more modern way to pack files than the initrd? And is it not reasonable to bring (or at least give the option for) the initramfs flow to be a bit more like the initrd flow? (Ie, with an empty rootfs).

pivot_root() could actually perform its as-designed function, although last I
checked it wasn't fully container-aware so tended to have fairly awkward global
impact if you ran it inside a container without being VERY careful. (Maybe it's
been fixed since?)

Most container runtimes that I am aware of would run a container within their own mount namespace so pivot_root should be safe from the rest of the system's point of view. Indeed pivot_root is the preferred option for container runtimes but cannot be used when running directly from rootfs.


You could also have your own tar.xz in rootfs with a tiny busybox/toybox root to
extract it into the subdir so "cp -ax" didn't have a 2X memory high water mark.
You could even have a little static binary to call so you don't even need a
shell. Off the top of my head:

void main(void)
{
 mkdir("blah", 0777);
 mount("newroot", "blah", "tmpfs", 0, "noswap,size=37%,huge=within_size");
 if (!fork()) exec("tar", "tar", "xpC", "blah", "blah.txz", NULL);
 else wait();
 if (!fork()) exec("rm", "rm", "blah.txz", "init", "tar", "xzcat", NULL);
 else wait();
 if (chdir("/blah") || chroot(".") || exec("/init")) complain_and_hang();
}

Statically linked against musl-libc that's not likely to be more than 32k, it's
all syscalls. The tar and xzcat binaries are a bit bigger, but not unreasonable
in either busybox or toybox...

Yep, I completely get this - and it is a good point. This is definitely a gray area on the "what should the kernel do vs what should we let userspace init handle". My reasoning for including it in the kernel is that all of the userspace init options to handle this (ie untaring something, or copying everything over straight away) amount to "double zipping" or moving something that the kernel has just extracted. This is a bit of a shame to require userspace to do, especially when it is a trivial patch to just have the kernel extract the initramfs to where we want it in the first place.


Or you could petition to add -x to mv I suppose. I could add it to toybox
tomorrow if you like? (And probably send a patch to Denys for busybox?)

I'm not sure how adding it to busybox would help - as you have already show, there are existing userspace workarounds (and I referred to two others in the patch's changelog: the tiny core linux and minikube init examples) so I'm not sure we need more?

How is it a "workaround"? The userspace tool is as old as initramfs.

Because it takes a thing that has just been extracted and moves it somewhere else. That is a workaround for it not being in the right place.

Your real complaint seems to be that a single ramfs instance is shared between
container instances, even when the PID 1 init process isn't.

Well, when rootfs is empty, it doesn't really matter that it's shared with all mount namespaces. My issue isn't with that, it's that the embedded initramfs flow is the one and only time that rootfs can't be relied upon to be empty.

What you're
"working around" is incomplete container namespace separation, and you're doing
so by adding yet another kernel config option. You are _adding_ a workaround to
the kernel.

What you are calling incomplete container namespace separation is the kernel's inability to unmount rootfs ever? I don't think that's a flaw - the logic for it makes perfect sense, you always have a rootfs so that you don't accidentally empty the mount tree. What doesn't make sense is then using that rootfs for anything more than that "stopper" under a "real" root - that's where the problems come in when attempting to swap roots for containers.

If you still need to complicate the kernel, wouldn't it make more sense to add a
runtime check for rootfstype=redundant or some such, and have _that_ do the
overmount (without needing a config symbol to micromanage a weird corner case
behavior)? If it's _init code it should be freed before launching PID 1...

The context that I'm talking about is situations where the init process within initramfs doesn't hand over to another init. This is for embedded initramfs situations.

I could do another version of the patch to check in the kernel for a rootfstype parameter if you like and work off of that rather than a build flag? Or would you not want that check within the kernel at all?

--
Emily Shepherd

Red Coat Development Limited




[Index of Archives]     [Linux Kernel]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux SCSI]

  Powered by Linux