Re: [RFC] initoverlayfs - a scalable initial filesystem

Eric Curtin <ecurtin@xxxxxxxxxx> · Tue, 12 Dec 2023 20:48:49 +0000

On Tue, 12 Dec 2023 at 20:35, Nils Kattenbeck <nilskemail@xxxxxxxxx> wrote:
>
> Hi, while I have been following this thread passively for now I also
> wanted to chime in.
>
> > (The main reason why sd-stub doesn't actually support erofs-initrds,
> > is that sd-stub also generates initrd cpios on the fly, to pass
> > credentials and system extension images to the kernel, and you can't
> > really mix erofs and cpio initrds into one)
>
> What prevents one from mixing the two (especially given that the
> hypothetical erofs initrd support does not yet exist)?
> Or are you talking about mixing this with your memmap+root=/dev/pmem suggestion?
>
> > The try to optimize the initrd a bit by making it an erofs/memmap
> > thing and so on. And make sure the initrd only contains stuff you
> > always need, so that reading it all into memory is necessary anyway,
> > and hence any approach that tries to run even the initrd off a disk
> > image won't be necessary becuase you need to read everything anyway.
>
> Having to ensure that the initrd is as small as possible is definitely
> no easy task.
> Furthermore unless one has total control over the devices, or even if
> there are only a few hardware revisions, parts of the initrd might not
> be used.
> Even if everything is the same there are codes paths which might not
> be taken during usual operation. An example would be services similar
> to the new systemd-bsod which are only triggered in emergencies.
> Having these in the cpio means that they will always be read and
> decompressed.
> Using sysexts also has the drawback that each and every one of them
> has to be decompressed. I might be mistaken but I expect that this
> will be the case even if the extension-release in the sysext results
> in it being discarded which is obviously another big drawback.
>
> Regardless, even if every single file within the cpio archive (and
> potential sysexts) is used, erofs still has a distinct advantage over
> cpio!
> With cpio everything has to be decompressed and read up front. With
> erofs this is not the case.
> Only the fs header has to be read at first as files are decompressed on demand.
> This means that critical stuff can be started earlier as it does not
> have to wait for decompression of stuff only needed later on.
> For example an initrd-only (i.e. not pivolint root), graphical system
> could start all background services long before the UI starts and
> accesses large asset files.
>
> I agree that this splitting up into another micro-initrd just for some
> storage stuff etc (which I still have not groked completely) does not
> seem to offer any advantages to what we have today. *However*, I
> certainly think that standardizing and supporting some kind of erofs
> based initrd would gain some advantages.

Are we sure? A bunch of stuff in modern initrd's today have nothing to
do with mounting storage. I've proved there's benefit to that with the
data on the initoverlayfs page, you save ~300ms on systemd start time
on a Raspberry Pi 4 with an sd card, if you use an NVMe drive over USB
on a Raspberry Pi 4 it's even more... ~500ms. I wouldn't say that's
insignificant. You still get all the functionality of the fully
fledged initramfs when systemd starts but you save between 300ms and
500ms.

>
> On the other hand this feels like going back to an old ramdisk again.
> This goes beyond my knowledge but based on the kernel docs most
> drawbacks of ramdisks would not apply to an approach with erofs. Also
> maybe the more flexible loopback devices could be used(?) which might
> alleviate some problems.

For the record, this is what we are doing for initoverlayfs at the
moment, mounting "/boot" partition and then loopback. There are
significant advantages as there are few bytes read until you start
using initoverlayfs.

/boot/initramfs-6.5.12-200.fc38.x86_64.img
/boot/initoverlayfs-6.5.12-200.fc38.x86_64.img

>
> -- This block device was of fixed size, so the filesystem mounted on
> it was of fixed size.
>    -> Should not be of concern as it is readonly anyhow.
> -- Using a ram disk also required unnecessarily copying memory from
> the fake block device into the page cache (and copying changes back
> out), as well as creating and destroying dentries.
>    -> (?) This one I am actually not too sure about and supersedes my
> knowledge on tmpfs, vfs (and its cache layers), erofs caching, and
> loopback devices).
> -- Plus it needed a filesystem driver (such as ext2) to format and
> interpret this data.
>    -> erofs is already included in most initrds (and is not too big if
> it is not)
>
> Regards, Nils
>