On 12/1/23 17:37, Emily Shepherd wrote: > I have to say I struggle to understand where to go from here... as I > said above, I do like the CLONE_NEWROOTFS suggestion (and it was > actually something I was batting around for my own project) but that > feels that a _way more_ specialised feature. > > And now you are saying that apparently we _shouldn't_ make a relatively > small change to initramfs because its worked fine for years, but we > should add a much larger patch to clone() which has also worked for many > years? No, "the perfect is the enemy of the good" applies. Blocking a small fix to force a large fix isn't always reasonable. I just dislike adding special cases. Right now when you root=/dev/sda1 you specify what to overmount on rootfs at runtime, so if you're going to overmount something _else_ that seems the layer to do it at. A CONFIG_BLAH=y option to do a similar thing (changing the semantics from a different area entirely) seems painful, especially as a workaround for a self-inflicted issue in a userspace package. I would very much like the kernel to get _simpler_ over time. It isn't, and it won't, and eventually we'll start over with https://www.youtube.com/watch?v=Ce1pMlZO_mI or something. But I can at least try to push back a bit to _slow_ the descent. (Unix lasted about 1 billion seconds, which rolled over Saturday September 8, 2001. The second billion seconds of the unix clock rolls over Tuesday, May 17 2033, which roughly coincides with Linus hitting retirement age. Not a profound observation, just a frame of reference...) Posix outlived Unix v7 and System V and so on because it defined an interface that could have its implementation swapped out. Posix is a TERRIBLE standard because if you implement a posix-only system it can't boot (no "init" command) and can't access filesystems (no "mount" command), but it's the subset people could agree upon. (Politics: they needed something with holes big enough to drive OS/360 and Windows NT through or else the big boys couldn't get federal procurement contracts while FIPS-151-2 was in force if they didn't nominally comply. Even 1980's Apple came out with A/UX, no really: https://www.youtube.com/watch?v=nwrTTXOg-KI ) A multiple-choice interface is harder to get test coverage on in a single implementation, let alone an IETF-style bakeoff where https://docs.freebsd.org/en/books/handbook/linuxemu/ and https://learn.microsoft.com/en-us/windows/wsl/ and https://9to5google.com/2021/02/12/google-fuchsia-os-android-linux-programs-starnix/ and so on all agree on a documented set of interfaces that can run the same code. There may be some API pruning once everybody young enough to remember when "the GPL" was a single thing (instead of Samba and Linux being unable to share code even though they implement two ends of the same protocol and are both GPL) has aged out of the productive flow, at which point you may not be able to _pay_ enough younguns to touch the modern equivalent of cobol... (I say this as someone who has reimplemented a gnu-compatible sed implementation TWICE, once in busybox, once in toybox. And lamented for a standard that actually MEANS something both times. I have long threads with Bash maintainer Chet Ramey about weird corner cases of Bash (http://lists.landley.net/pipermail/toybox-landley.net/2023-June/029616.html) because I'm implementing a bash compatible shell from scratch in toybox, and the closest I have to a "standard" is the bash man page which does not always document what bash actually _does_. (Alas, Chet keeps FIXING things I bring up, which he considers progress and I consider making bash a moving target...) Anyway, this sort of thing tends to be on my mind a lot. If you assume an ABI/API is gonna get extracted from this with a new implementation stuck under it someday (as has happened before), "which bits will definitely get pruned but probably cause collateral damage" is a question that comes up. I expect "a minimal host system capable of running containers" to be a fairly EARLY cloning target... > I shouldn't question how initramfs works because you were there > when it was written [1], but we should question all the devs who decided > on CLONE_NEWNS over CLONE_NEWROOTFS? Oh no, please question it. Question everything. And I only started paying attention to this one a little _after_ it was written. Early adopter, not author. Reported various bugs, wrote the documentation I'd wanted to read, genericized the userspace tooling a bit... But it was Al Viro's baby. Speaking of which, all the http://www.uwsg.iu.edu/ links in said docs still work if you switch them to https://lkml.iu.edu/ and leave the rest under it. I should push a patch, but the linux development community chased all the hobbyists who used to fix that sort of thing away at least ten years ago, sometime before https://lwn.net/Articles/563578/ so nothing that doesn't affect Red IBM Hat's bottom line really gets addressed in vanilla these days. They just sort of linger if it's not worth billable hours for a career engineer to do on the clock, run through Jira, and check off on the spreadsheet in the standup. Those links have been broken for YEARS. I fixed my local copy. People occasionally email me and I tell them the update. But nobody's tried to push a patch through the signed-off-by in triplicate with the 47 files in Documentation/process including 873 lines of submitting-patches.rst and a 24 step submit-checklist.rst which sort of assume you've read contribution-maturity-model.rst and "The lifecycle of a patch" section out of 2.Process.rst and... At least until the network admin running kernel.org gets his way and closes down the open mailing list, replaced by one that only approved people are allowed to join: https://social.kernel.org/objects/9b3adb80-4198-4c86-abbd-aa3c58700975 And then they stop taking patches by email: https://social.kernel.org/objects/fbda91b8-f865-4ee5-9a40-22a2c70479f4 *shrug* See above about me waiting to see what replaces all this when it rolls to a stop... > I'm not saying we shouldn't, but help me out here - how can I tell > what's "reasonable" to question and what isn't? Everything is reasonable to question. Not always helpful, but reasonable. > I merely meant that there > are a hell of a lot of different build options and systems within the > kernel, and it is perhaps not unreasonable to suggest that it is not a > requirement that everyone intimately understands all of them all of the > time. I'm weird enough to still _try_. At least in the parts common to the systems I'm building on a dozen different architectures. (_Everybody_ has to go through early boot.) I'm currently trying to get vanilla u-boot, linux, and devuan debootstrap to run on the orange pi 3b because I don't trust anything that keeps its repo on "huaweicloud" to _not_ have spyware in it because Xi Who Must Be Obeyed ordered it so. The hardware was put together by some very nice engineers, who seem to have pushed support upstream into the various vanilla projects, so I _should_ be able to get all-vanilla to work on this. (Unlike raspberry pi which is still binary blobs as far as the bootloader can see and a forked kernel.) But in order to build a fully capable u-boot for this board I need an or1k cross-compiler because the power controller needs firmware, which they provide the source code to but somebody actually made an openrisc ASIC (really!) to control the power, so you need to compile it with an or1k cross compiler to make the firmware to load into it, and if u-boot doesn't initialize this hardware, the Linux kernel it hands off to can't suspend or reboot the board from software: https://github.com/u-boot/u-boot/blob/master/board/sunxi/README.sunxi64#L64 The problem is, if I get distracted by that, and then go "hey, hexagon finally has qemu-system emulation now" (ala https://github.com/quic/toolchain_for_hexagon/commit/8a8923bd6c6a) and so on, if I don't come back to other projects for a couple releases stuff's bit-rotted behind my back and I have to bisect and reverse engineer it. The change under discussion here is a case where explaining the design context behind this distinction, let alone the decision to change it, is multiple minutes for a domain expert to unpack the backstory for you, and hours if not days to pick apart yourself. It changes what the design IS. I personally already _know_ (some of?) the backstory, but I don't expect other people to, and really don't look forward to having to document it. >>You are not the first person to use this plumbing. "Everybody _really_ >>wants >>what I think it should always have been like, but nobody's mentioned it in the >>past 20 years" is a strange position to take. Earlier you said "the fact that >>the desirable path is" as a universal statement rather than a personal opinion. >>Desirable to who? Judged as "fact" by who? > > I meant for container runtimes. Most are quite opinionated about not > doing mount --move . / && chroot(.), strictly preferring pivot_root > instead. Indeed. They want to start with an empty mount tree, and they don't want to umount all the stuff they inherited. It's an understandable desire, but repurposing pivot_root for this was not exactly an elegant solution, as this thread is just one aspect of. People get so stuck defending a solution they forget what the problem was. Rob