Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs

Rob Landley <rob@xxxxxxxxxxx> · Wed, 29 Nov 2023 10:38:48 -0600

On 11/29/23 03:00, Emily Shepherd wrote:
> For systems which run directly from initramfs, it is not possible to use
> pivot_root without first changing root. This is because of the
> intentional design choice that rootfs, which is where initramfs is
> unpacked to, cannot be unmounted.
> 
> pivot_root is an important feature for creating containers

Because nobody's ever wanted to fix chroot() so mkdir("sub", 0777);
chroot("sub"); chdir("../../../../.."); chroot("."); wouldn't escape it, so they
repurposed a syscall intended to dispose of initial ramdisks that does things
like traverse the process list and close/reopen all the open files pointing to
the old filesystem (including "." and "/") so they point to the new filesystem.

Personally, I always found THAT a bit awkward, but there wasn't a "clean" system
call that _only_ did what the container guys wanted (patch the mount tree). I
would have thought you could use "mount --move . /" to nerf the cd ../../.. but
for some reason it didn't work (I forget why) and nobody wanted to fix that either.

(By the way, I used pivot_root() on ramfs back when it DID move it, which then
allowed you to unmount it, at which point the kernel locked up as the doubly
linked list traversal kept going until they hit the initramfs entry that was
"always there" and thus reliably terminated the list... Yeah, that got fixed,
now the pivot_root returns an error. Being an initramfs early adopter was
"interesting"...)

> and the
> alternative (mounting the new root over the top of the old with MS_MOVE
> and then calling chroot) is not favoured by most container runtimes
> [1][2] as it does not completely remove the host system mounts from the
> mount namespace.
> 
> The general work around, when running directly from initramfs, is to
> have init mount a new tmpfs, copy everything out of rootfs, and then
> switch_root [3][4].

Which is why I added switch_root to busybox 18 years ago, yes. (I thought I got
the idea from klibc but I'm not finding it in their git repo. I didn't invent
it, there was an existing one somewhere, my 2005 busybox commit comment credits
run_init.c from "kconfig" which can't be right. I did rename it to be more
obviously analogous to pivot_root...)

> This is only required when running directly from the
> initramfs as all other methods of acquiring a root device (having the
> kernel mount a root device directly via the root= parameter, or using
> initramfs to mount and then switch_root to a new root) leave an empty
> rootfs at the top of the mount stack.

If you don't use rootfs you don't have to empty it, yes.

You could use an old-style initrd which would be mounted over the root
filesystem and which you could switch_root away from and then unmount. Then
pivot_root() could actually perform its as-designed function, although last I
checked it wasn't fully container-aware so tended to have fairly awkward global
impact if you ran it inside a container without being VERY careful. (Maybe it's
been fixed since?)

You could also have your own tar.xz in rootfs with a tiny busybox/toybox root to
extract it into the subdir so "cp -ax" didn't have a 2X memory high water mark.
You could even have a little static binary to call so you don't even need a
shell. Off the top of my head:

void main(void)
{
  mkdir("blah", 0777);
  mount("newroot", "blah", "tmpfs", 0, "noswap,size=37%,huge=within_size");
  if (!fork()) exec("tar", "tar", "xpC", "blah", "blah.txz", NULL);
  else wait();
  if (!fork()) exec("rm", "rm", "blah.txz", "init", "tar", "xzcat", NULL);
  else wait();
  if (chdir("/blah") || chroot(".") || exec("/init")) complain_and_hang();
}

Statically linked against musl-libc that's not likely to be more than 32k, it's
all syscalls. The tar and xzcat binaries are a bit bigger, but not unreasonable
in either busybox or toybox...

Or you could petition to add -x to mv I suppose. I could add it to toybox
tomorrow if you like? (And probably send a patch to Denys for busybox?)

> This commit adds a new build option - EMPTY_ROOTFS, available when
> initrd/initramfs is enabled. When selected, rather than unpacking the
> inbuilt / bootloader provided initramfs directly into rootfs, the kernel
> will mount a new tmpfs/ramfs over the top of the rootfs and unpack to
> that instead, leaving an empty rootfs at the top of the stack. This
> removes the need to have init copy everything as a workaround.

How is it a "workaround"? The userspace tool is as old as initramfs.

Your real complaint seems to be that a single ramfs instance is shared between
container instances, even when the PID 1 init process isn't. What you're
"working around" is incomplete container namespace separation, and you're doing
so by adding yet another kernel config option. You are _adding_ a workaround to
the kernel.

If you still need to complicate the kernel, wouldn't it make more sense to add a
runtime check for rootfstype=redundant or some such, and have _that_ do the
overmount (without needing a config symbol to micromanage a weird corner case
behavior)? If it's _init code it should be freed before launching PID 1...

Rob