Re: [PATCH v2] initramfs: Support unpacking directly to tmpfs

Rob Landley <rob@xxxxxxxxxxx> · Fri, 1 Dec 2023 16:02:50 -0600

On 11/29/23 21:31, Emily Shepherd wrote:
> On Wed, Nov 29, 2023 at 02:53:49PM -0600, Rob Landley wrote:
>>I'm assuming you can do process-local unmounts to prune what you'd be
>>overmounting? (switch_root didn't _have_ to delete the old initramfs contents,
>>it did it to free space. Containers similarly would remove privileges as part of
>>their setup, and that includes mounts the new context shouldn't have.)
> 
> I think we are talking at cross purposes here, let's get back on track: 
> The aim is to do a pivot_root within a container's mount namespace.

Your aim is. You are starting from your conclusion and reasoning backwards.

> In 
> order to do this, we need at least two layers of "root" in the host 
> mount namespaces. If there is just one (ie rootfs), we can't pivot_root, 
> because rootfs cannot be removed.
> 
> Yes, you can switch_root or chroot or do any number of things, but that 
> is not relevant.

Those approaches have worked fine for many people for many years.

Possibly I'm biased, I thought using pivot_root was silly when I learned
containers were doing it back in 2011:

  https://landley.net/notes-2011.html#02-06-2011

(It's a pity the sourceforge link doesn't work anymore. Cloud rot.)

> That is not the way container runtimes tend to work. 
> The desirable outcome is pivot_root.

Still reasoning backwards from your conclusion. But sure, let's accept for the
sake of argument that you're stuck with a bad legacy decision in userspace. Is a
kernel config option the best way to compensate vs checking root=tmpfs?

>>From reading the link's summary, it seems like "unmount the old 
>>inherited /proc
>>and /sys before you "mount --move newfs /" seems like it would have been a fix?
> 
> It was the patch that was made in the container runtimes at the time, 
> yes. This does not change the fact that the _desirable_ path is 
> pivot_root.

https://en.wikipedia.org/wiki/Proof_by_assertion

>>> This would not occur with pivot_root.
>>
>>It would not occur if the filesystem had been removed from the current mount
>>namespace by other means, either. (Or if the kernel got the test right, which
>>you're saying it does now.)
> 
> You can't remove it if it's rootfs. If your host's root is rootfs, as it 
> would be if you run directly from initramfs, you can't unmount it.

You can't remove rootfs even if you do overmount something on top of it. You can
only delete its contents, it's still a writeable filesystem shared between all
container instances. A userspace tool was provided to delete its contents many
years ago. Said tool has been in util-linux since 2009
(https://github.com/util-linux/util-linux/commit/711ea7307d54) and they copied
the name from https://git.busybox.net/busybox/commit/?id=0f34a821ab99 from 2005.

If "the same rootfs is potentially exposed into every container namespace, which
is a problem even after it's been emptied" is the issue, your patch doesn't
address it. If it's NOT the issue, userspace has been able to achieve the state
you want without your patch since initramfs was created.

Your argument is "I don't want to". Which isn't a deal-breaker, but is the
context in which I'm reacting to the design change.

>>Back when I was trying to get /dev/console to work properly with 
>>init=/bin/sh I
>>didn't ask for a kconfig option to make the initial task be PID 2 with PID 1
>>(and its magic signal-blocking properties and inability to call reboot() and
>>session ID 0 and orphaned zombies reparented to it so on) being a second idle
>>task. I figured out how to make my userspace do the right thing.
>>
>>This really seems like an "init starts as PID 2" solution, which is a weird
>>thing to have a dedicated build-time kernel config option for.
> 
> I am afraid I do not understand this point at all. This change is not 
> requesting anything like this.

If I wanted the kernel to launch an init task for me, but I didn't want that
init task to be on PID 1, and I said that PID 1 has a bunch of strange
properties so it's inappropriate for that to run my code, and I proposed a patch
with a dedicated config option to do that, this would be analogous to the patch
you've presented.

If someone responded "you can call fork()" and I tried to come up with reasons
other than "I don't want to"... at that point we're in a Jim Jeffries routine.

And no this isn't an entirely hypothetical case, this was me scratching my head
for more than a year back in the busybox days, pestering people on IRC while
working out out to write oneit.c:

http://lists.busybox.net/pipermail/busybox/2008-November/067722.html

>>You want to use rootfs but not use rootfs. It's very zen.
> 
> No, I want an initramfs, I just don't want it on rootfs.

There's ways of dealing with that. Have been for a while. I would respond to a
dedicated CONFIG_INIT_ADJUST to have PID 1 be a second idle task and put init=
on PID 2 fairly similarly.

> If I mount a 
> block device as root, that wouldn't be rootfs either.

I am aware of that, yes. I think I suggested it.

You are reasoning backwards from your solution and not thinking about the
design. I don't think you're addressing the real issue.

Right now "separate" container namespaces all share a common rootfs instance.
They do NOT share a common init task, even though before containers that was
universal. You can have your own PID namespace, which starts _empty_.

Your mount tree in a container does NOT start empty. From the clone(2) man page:

  If  CLONE_NEWNS  is  set,  the  cloned child is started in a new
  mount namespace, initialized with a copy of the namespace of the parent.

Defaulting to having everything in it and removing what you don't want to keep
is very different from what PID or UID namespaces do, and is causing you
problems. Doing a chroot is basically an overmount, the other mount points are
still there in your tree and accessable if you try hard enough, and rootfs is
common to all containers. Mitigating this requires cleanup work that isn't
always even possible to fully do (ala rootfs actually being used, which does
happen a lot today and it's always accessible if a static process forking its
own mount namespace does enough umounts, which can then act as a
cifs/nfs/9p/rsync server out to the parent or some such).

Logically, extending the kernel to have a CLONE_NEWROOTFS where it gets a _new_
ramfs or tmpfs instance, unique to that namespace, at the root of a new empty
mount tree, is the logical fix. There is then design work around "so what API do
you use to populate it" which could range from "the first int below child_stack
is the fd of a cpio.gz to extract into it and then it launches an /init out of
there the way the host linux boots" through "the new child starts suspended ala
vfork/ptrace and then the parent process initializes it and unblocks it" to "the
init task is running the executable from the host context that called clone and
has inherited the existing open filehandles from the host context, although
despite the openat() family being in posix-2008 we sadly don't appear to have a
mountat()...". I dunno. That's design work to properly fix the issue.

You don't want to address the design problem, you want to add a special case
workaround for your current issue. You see doing that as a "design fix". I do not.

>>If you'd like to go "I want to have the kernel automatically mount a 
>>freshly
>>formatted ext4 filesystem and then have the kernel extract the cpio archive into
>>that instead, because it's more convenient for me to have the kernel do this for
>>me than doing it in userspace"...
> 
> Not what this patch is suggesting.

Yes, I've noticed.

My objection is "that may not be not the right fix at the design level" and your
response is "my patch didn't implement what you're talking about".

> The kernel already supports mounting a block device, in lieu of a 
> userspace init doing it, via the root= parameter. Are you suggesting its 
> support of that is inappropriate?
> 
>>If you fix the mount --move issues you could bind mount your current 
>>directory,
>>cd $PWD, and then --move mount it to /
>>
>>I think you're addressing the wrong issue.
> 
> No, I'm fixing the fact that container runtimes want to pivot_root, and 
> can't when running directly from initramfs, as this extracts to rootfs.

Which you can already do with switch_root, with initrd, with a static binary
calling mount/mv/chroot, and probably other ways none of which would be you
exclusively reasoning backwards from your conclusion. You don't seem to want to
do that, and only want to talk about your solution not talk about the problem.

Fine. Moving on. I still think a dedicated CONFIG entry is a bad way to do the
silly thing. Specifying the silly thing on the kernel command line seems less bad.

>>As with the trivial patch to have init= launch PID 2, the cognitive 
>>load of
>>explaining to people WHY the config option exists and when somebody might have
>>wanted to use it in the kernel you're trying to forward port in a design you
>>inherited from somebody who isn't around anymore is itself a form of design
>>complexity. It's a special case _adding_ a design wart.
> 
> Is it possible that the reasoning of why this important would be much 
> more apparent to people in the container space?

Are you suggesting I don't understand because I'm not "one of us"?

Personally I think the problem is more likely that I _predate_ you, and had a
small hand in creating this entire area, and thus don't have the same hardwired
assumptions about this being the only possible way it could ever have been done.

In 2011 while working the OpenVZ booth at Scale along with Kir Kolyshkin (the
OpenVZ userspace maintainer, we were coworkers when I did a contract at
Parallels), I came up with the phrase "chroot on steroids" as part of my 30
second pitch explaining the conceptual difference between virtualization and
containerization to attendees who stopped at the booth to ask what we offered.
(I wrote down "chroot on steroids" back in
https://landley.net/notes-2011.html#19-04-2011 and got con crud at scale in
https://landley.livejournal.com/53863.html . Yeah, sometimes I do marketing on
the side. Startups need everybody to do everything. Containers were just SO COOL
and nobody KNEW about them back then.) I also wrote various documentation like
https://landley.net/lxc/ while adding adding container support to things like
CIFS in commit f1d0c998653f.

The team I worked with at at Parallels was porting container technology from
OpenVZ into Linux, which is how vanilla Linux got containers. (The line of
development dated back to a Russian bank moving from mainframes to linux in 1999
and writing in-house mainframe-like extensions to Linux. I asked Kir, who had
worked for said bank before the tech was spun out. Don't ask me the details of
the 12 year old conversation, but he's still around if you want to ask him. He
works at Red Hat these days.)

Later I attended the https://github.com/Fewbytes/rubber-docker talk live at
linuxconf.au in 2017 and have collected links like
https://blog.lizzie.io/linux-containers-in-500-loc.html ever since. I've meant
to add basic container support to toybox but it's been fairly far down on my
todo list, partly because I talked the android developers' ears off about
containers back when they merged toybox in 2015 (ala
http://lists.landley.net/pipermail/toybox-landley.net/2015-February/015135.html)
and a year later they wrote minijail (ala https://lwn.net/Articles/700557/) so
they've got their own plumbing now that I'd have to be compatible with. And once
I implemented unshare+nsenter in toybox and I've mostly just done stuff like
"sudo env -i USER=root TERM=linux SHELL=/bin/bash LANG=$LANG
PATH=/bin:/sbin:/usr/bin:/usr/sbin unshare -Cimnpuf chroot debootstrap" which is
usually good enough for my personal quick and dirty use cases.

I watched the rise of docker, was in the audience at ELC when the systemd guys
announced rocket, still think the initial cgroup filesystem had a better design
than cgroup2 (it could NEST, and why they don't properly instance devtmpfs in
containers I still don't understand). I vaguely followed the multiple variants
of flatpak that emerged because Ulrich Drepper hated static linking and too many
people swallowed that line of nonsense. I remember when /proc/self/exe being
inherited from the host became an exploit vector (an area I had previous
interest in for different reasons ala https://lkml.org/lkml/2017/9/12/175)...

But mostly what I was paying attention to was the checkpoint/restore stuff in
case that was reasonable to implement in toybox, largely because of prior
interest: back in 2002
https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0610.html and
https://lkml.iu.edu/hypermail/linux/kernel/0206.2/0835.html got reblogged a lot
by various lwn.net competitors, so people would ask _me_ about it and it was so
nice when we finally got plumbing that let people _implement_ it. I mean OpenVZ
already had live migration working great at the 2011 SCALE demo, that's what Kir
was showing people at the other end of the table (and what I handed to them to
when I got them spun up on the basic "containers are cool and it's not the same
as VMs, all this 'cloud' stuff they're building from beowulf should pivot from
VM to container ASAP", but it took a LOT of redesign to chip off pieces and get
them into vanilla. (For one thing OpenVZ had added a bunch of new syscalls and
Linus wanted to use synthetic filesystems instead.)

You can argue that I'm _out_of_date_ if you like. (Probably true, on most
things. Jack of all trades, master of none. Usually I know who to ask.) But I
have some familiarity with, and reason to care about, both container
infrastructure and Linux early boot.

> I disagree that this introduces a design wart. On the contrary, I 
> believe it adds the option to make initramfs more consistent with the 
> other root setup methods:

A think it's silly, and think having a dedicated build-time CONFIG option
dedicated to doing something silly is not ideal.

Checking for "root=tmpfs" to trigger the silly thing seems less bad to me,
although I note that init/do_mounts.c function init_rootfs() already _is_
checking for that (and there's a pending patch to tweak it), so... be aware.

> 1. kernel mounted block device via root= results in a nominally empty 
> rootfs and a block device on top with the root file system in it. 
> pivot_root can be used.
> 2. initramfs which performs some init, mounts a block device, then 
> switches root to it. This results in a nominally empty rootfs and a 
> block device on top with the root file system in it. pivot_root can be 
> used.
> 3. initramfs which contains an embedded root filesystem to be used 
> directly. Results in a rootfs with the root file system in it with 
> nothing on top. pivot_root cannot be used.
> 
> This patch simply changes point 3, to be more in line with the others:

I believe I have basic context for Linux early boot.

>>You're asking the kernel to create a second empty ramfs or tmpfs 
>>instance, and
>>instead of checking an existing argument like "root=tmpfs" you're changing the
>>kernel's behavior with a dedicated config option that does a specific thing.
> 
> If we want to set this behaviour via a kernel parameter, we can do that 
> :)

As a convenience feature, a kernel parameter makes more sense to me. I still
think it's a silly thing to want to do, but it means I don't have to teach
people to fish it out of analogous defconfigs when they're learning to do board
bringup. (The hardware vendor usually has a Linux BSP on some variant of a demo
board, otherwise the people speccing the product wouldn't have selected that
chipset for a Linux project. There are exceptions, but sane management doesn't
throw newbies at them.)

>>Whatever your container stuff is
> 
> Love it or hate it, lots of stuff runs on containers now. The kernel has 
> made plenty of changes to better facilitate containers.

Yes. I was there. Helping make that happen was, briefly, my day job.

>>it won't be
>>able to run on any of those existing systems that keeps initramfs populated with
>>files. So again why have it be a config option: if you're going to change the
>>behavior, change it for EVERYBODY or your stuff will need a special kernel
>>configuration in order to run.
> 
> Sure, if we think its more appropriate to just do this always (not via a 
> build option) or gated behind a kernel parameter, we can do that.

I think doing it unconditionally will break existing users. And be unnecessary
for 95% of the existing users of initramfs.

Heck, I didn't expect the CONFIG_DEVTMPFS_MOUNT patch above would break debian's
init ala https://lkml.iu.edu/hypermail/linux/kernel/1705.2/05813.html because
they had a broken error handling path in their init script that had never
triggered before, and when it did trigger it did something stupid and crashed
the system. And yet that blocked the patch, and then adding a workaround for
debian hit
https://lore.kernel.org/linux-input/d6a5ba05-5de2-24e0-49ae-437058001b37@xxxxxxxxxxx/
and even though
https://lkml.iu.edu/hypermail/linux/kernel/1709.2/02400.html has almost
certainly happened by now the patch is still maintained out of tree...

Also, my inittmpfs patches in 2013 only used tmpfs instead of ramfs under
specific circumstances, and left it as ramfs in others (mostly determined by the
root= and rootfstype= arguments). Despite that, I got email from somebody who
switched their system to tmpfs and had it break because the cpio.gz they were
extracting ate about 60% of the kernel's memory (which worked fine in their use
case) and the tmpfs mounts default to size=50%, so it worked for ramfs but not
tmpfs. And since ramfs takes no arguments rootflags= was never wired up to be
passed through there to override the size=, and I _still_ haven't bothered to do
it (the bug reporter went back to ramfs, and it's a static variable in
init/do_mounts.c and exporting it over through rootfs_init_fs_context() requires
touching header files so wasn't a 5 minute job... If somebody else wants to,
feel free.)

>>Heck, Debian populates initramfs with a cpio.gz file as part of its normal boot
>>process:
>>
>>$ zcat /boot/initrd.img-4.19.0-22-amd64  | toybox file -
>>-: ASCII cpio archive (SVR4 with no CRC)
>>
>>Has done for over a decade. You're saying debian can clean up but your stuff
>>can't be expected to.
> 
> No, that is not what I'm saying.

That's the part I don't understand. It _seems_ like what you were saying. Not
"this hasn't been working fine for everyone else for the past 15 years already",
but "I think it should have been designed a different way 20 years ago, and
would like to change it to match my opinion".

You're not unlocking a new capability you can't do without this patch. You just
think it should always have worked differently than it does.

I've submitted a few convenience patches myself over the years. Commit
595a22acee26 eventually made it in, and I've maintained
https://lkml.iu.edu/hypermail/linux/kernel/2302.2/05597.html out of tree since
https://lkml.iu.edu/hypermail/linux/kernel/1606.2/05686.html . I am not
conceptually opposed to convenience patches. (Nor am I the final decision maker
here, by the way...)

I'm just not seeing how the old way is _hard_ when the tool for it is in
util-linux (as I said, I may be biased), and I don't think this addresses the
underlying issue of the same rootfs being globally visible in all containers
when the same init task isn't (and is similarly "kernel panics if this exits"
important).

>>If you want a NULLFS, that is a design change. Maybe ask for the design 
>>change
>>so THAT can be discussed. Your config option seems like a partial fix at best,
>>and the kernel has enough abandoned partial fixes needing legacy support.
> 
> We already have what you call a nullfs. It's defined in 
> init/noinitramfs.c and usr/default_cpio_list, and its what you get if 
> you call switch_root within the initramfs.

I mean a filesystem type that maintains no state and _cannot_ be written to,
which your patch doesn't get because the rootfs that's there _can_ still be
written to if you try.

Hiding stuff in undermounts is classic 1970's student shenanigans, at best
you're making it less obvious. The ramfs instance that's always mounted and
never written to is still mapped into every container namespace, and the fact it
_can_ be written to is non-obvious.

Note: that's basically what Linux had before initramfs was invented. The root=
mount was mandatory and there was nothing under it. There _being_ a rootfs
instead was a design decision made 20 years ago by other people, which you seem
to be trying to revert.

> In most runtime situations, rootfs _is_ what you'd call a nullfs.

Where "most" apparently does not include the category of "routers", which are
one of the most numerous types of Linux system in the world. (According to
https://www.bitdefender.com/blog/hotforsecurity/current-routers-use-eol-linux-kernel-chock-full-vulnerabilities/
91% of the routers they tested ran linux, just the residential router market was
$11 billion in 2023, which assuming an average sale price of $200 is 50 million
units. The PC market shipped 68 million units but most of those run windows or
macos.)

LOTS of embedded people have used the existing initramfs, and it's accumulated a
BUNCH of weirdness over the years. Did you know you can concatenate multiple
cpio.gz files and the kernel loader will accept them as one big archive? Except
gnu cpio adds runs of NUL bytes to the end, which broke the parser at one point,
and then the android guys used a tool that reproduced the gnu behavior, and then
I got added to the bug report...

http://lists.landley.net/pipermail/toybox-landley.net/2021-April/028464.html

Every one of those archives was extracted into the initramfs we have now.

There have also been multiple threads about adding xattr support to cpio and the
initramfs extractor (without which selinux bringup is WAY more awkward, and yes
I get cc'd on them ala
https://lkml.iu.edu/hypermail/linux/kernel/1905.2/06559.html), which so far keep
petering out without resolution...

You are not the first person to use this plumbing. "Everybody _really_ wants
what I think it should always have been like, but nobody's mentioned it in the
past 20 years" is a strange position to take. Earlier you said "the fact that
the desirable path is" as a universal statement rather than a personal opinion.
Desirable to who? Judged as "fact" by who?

> So 
> yes, sure: I want a nullfs when my root filesystem lives inside the 
> initramfs too.

You didn't understand what I meant by nullfs, I tried to clarify above.

> Like I'd get if I'm mounting with root= and like I'd get 
> if initramfs calls switch_root.

I.E. like you can already get in multiple ways from userspace without this patch.

A special root=tmpfs or similar workaround to change the behavior is small and
ignoreable enough that I personally wouldn't object to it, if it didn't break
anything else. I still don't think it's the right approach, but you do you.

(For one thing, I can just NOT document "root=tmpfs" and it's invisible, and
doesn't cause a problem. Where CONFIG_THINGY is a thing people see in menuconfig
and I have to explain why they DON'T need it. In a previous life I was (again
briefly) linux-kernel documentation maintainer, and still produce a lot of it in
other contexts. My "oh goddess, not another variant to explain during early
boot" reaction may be unique, I dunno, but... sigh. It is not entirely without
cost.)

Rob