On 11/29/23 03:00, Emily Shepherd wrote: > For systems which run directly from initramfs, it is not possible to use > pivot_root without first changing root. This is because of the > intentional design choice that rootfs, which is where initramfs is > unpacked to, cannot be unmounted. > > pivot_root is an important feature for creating containers Because nobody's ever wanted to fix chroot() so mkdir("sub", 0777); chroot("sub"); chdir("../../../../.."); chroot("."); wouldn't escape it, so they repurposed a syscall intended to dispose of initial ramdisks that does things like traverse the process list and close/reopen all the open files pointing to the old filesystem (including "." and "/") so they point to the new filesystem. Personally, I always found THAT a bit awkward, but there wasn't a "clean" system call that _only_ did what the container guys wanted (patch the mount tree). I would have thought you could use "mount --move . /" to nerf the cd ../../.. but for some reason it didn't work (I forget why) and nobody wanted to fix that either. (By the way, I used pivot_root() on ramfs back when it DID move it, which then allowed you to unmount it, at which point the kernel locked up as the doubly linked list traversal kept going until they hit the initramfs entry that was "always there" and thus reliably terminated the list... Yeah, that got fixed, now the pivot_root returns an error. Being an initramfs early adopter was "interesting"...) > and the > alternative (mounting the new root over the top of the old with MS_MOVE > and then calling chroot) is not favoured by most container runtimes > [1][2] as it does not completely remove the host system mounts from the > mount namespace. > > The general work around, when running directly from initramfs, is to > have init mount a new tmpfs, copy everything out of rootfs, and then > switch_root [3][4]. Which is why I added switch_root to busybox 18 years ago, yes. (I thought I got the idea from klibc but I'm not finding it in their git repo. I didn't invent it, there was an existing one somewhere, my 2005 busybox commit comment credits run_init.c from "kconfig" which can't be right. I did rename it to be more obviously analogous to pivot_root...) > This is only required when running directly from the > initramfs as all other methods of acquiring a root device (having the > kernel mount a root device directly via the root= parameter, or using > initramfs to mount and then switch_root to a new root) leave an empty > rootfs at the top of the mount stack. If you don't use rootfs you don't have to empty it, yes. You could use an old-style initrd which would be mounted over the root filesystem and which you could switch_root away from and then unmount. Then pivot_root() could actually perform its as-designed function, although last I checked it wasn't fully container-aware so tended to have fairly awkward global impact if you ran it inside a container without being VERY careful. (Maybe it's been fixed since?) You could also have your own tar.xz in rootfs with a tiny busybox/toybox root to extract it into the subdir so "cp -ax" didn't have a 2X memory high water mark. You could even have a little static binary to call so you don't even need a shell. Off the top of my head: void main(void) { mkdir("blah", 0777); mount("newroot", "blah", "tmpfs", 0, "noswap,size=37%,huge=within_size"); if (!fork()) exec("tar", "tar", "xpC", "blah", "blah.txz", NULL); else wait(); if (!fork()) exec("rm", "rm", "blah.txz", "init", "tar", "xzcat", NULL); else wait(); if (chdir("/blah") || chroot(".") || exec("/init")) complain_and_hang(); } Statically linked against musl-libc that's not likely to be more than 32k, it's all syscalls. The tar and xzcat binaries are a bit bigger, but not unreasonable in either busybox or toybox... Or you could petition to add -x to mv I suppose. I could add it to toybox tomorrow if you like? (And probably send a patch to Denys for busybox?) > This commit adds a new build option - EMPTY_ROOTFS, available when > initrd/initramfs is enabled. When selected, rather than unpacking the > inbuilt / bootloader provided initramfs directly into rootfs, the kernel > will mount a new tmpfs/ramfs over the top of the rootfs and unpack to > that instead, leaving an empty rootfs at the top of the stack. This > removes the need to have init copy everything as a workaround. How is it a "workaround"? The userspace tool is as old as initramfs. Your real complaint seems to be that a single ramfs instance is shared between container instances, even when the PID 1 init process isn't. What you're "working around" is incomplete container namespace separation, and you're doing so by adding yet another kernel config option. You are _adding_ a workaround to the kernel. If you still need to complicate the kernel, wouldn't it make more sense to add a runtime check for rootfstype=redundant or some such, and have _that_ do the overmount (without needing a config symbol to micromanage a weird corner case behavior)? If it's _init code it should be freed before launching PID 1... Rob