Re: Installation image layout

Chris Murphy <lists@xxxxxxxxxxxxxxxxx> · Thu, 11 Oct 2018 21:24:08 -0600

On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki
<marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> Hi all!
>
> I'm new on this list. I work on Qubes OS, where Fedora is used as a base
> distribution.
>
> While trying to build the installation image in reproducible manner[1],
> I found the current installation image have unusual layout. Quoting
> dracut.cmdline manual page:
>
>        squashfs.img          |  Squashfs from LiveCD .iso downloaded via network
>           !(mount)
>           /LiveOS
>               |- rootfs.img  |  Filesystem image to mount read-only
>                    !(mount)
>                    /bin      |  Live filesystem
>                    /boot     |
>                    /dev      |
>                    ...       |
>
> This rootfs.img layer makes the image build very much unreproducible.
> Why is it even there? Bare squashfs.img layer should be enough. Then,
> mount overlayfs over it (I see there is even some partial support for it
> in dmsquash-live). Most other Live systems I've seen use just squashfs +
> overlayfs (or aufs if kernel is older), so it's commonly tested
> configuration. I *guess* it's there for historical reason, from before
> aufs/overlayfs being available. Is there any other reason for that?

I'm pretty sure the original reason was the default live install use
dd to block copy the root file system into the fedora-root LV, and
then resized the LV and ext4 file system. There have also been a
number of squashfs improvements since that decision so there might
have been limitations with squashfs that ext4 didn't have (I'm
thinking xattr were long supported in ext4 before squashfs, and maybe
capabilities?)

>
> If there is no other reason, I propose to drop this and have
> installer/live filesystem directly in squashfs.img. This have multiple
> benefits:
>  - it's much easier to make the image build process reproducible (see
>    below)
>  - less complexity, both in the build and in the boot (the whole
>    dmsquash-live dracut module can be replaced with <20 line
>    function[2]
>  - smaller initramfs (which is extremely important if needed to be
>    included in efiboot.img, which can't be larger than 32MB)
>  - slightly faster boot time (device-mapper is slow)
>
> What do you think?

Whatever we do should take into account the persistent root and
persistent home use cases, specifically:
https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso-to-disk.sh

--overlay-size-mb
--home-size-mb

A particular criticism of the device-mapper solution currently being
used is in that script: it blows up. Literally it's WORM, and deleting
files simply dereferences them, it doesn't free up pool space, so it
is inevitable that the pool will fill up, and when it does it blows up
the file system, and it can't be repaired. All you can do is reset the
overlay which means deleting all changes and starting over.

At least one of our spins, SOAS, depends on livecd-iso-to-disk for
creating their final installation because it's predicated on running
Fedora SOAS from a stick.

Why does efiboot.img have a 32MiB limit?

> As for the reproducibility, I've made changes to lorax (including
> dropping rootfs.img layer), anaconda, pungi and createrepo and this all
> allows to build bit-by-bit identical image, given the same input (rpm
> packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well,
> almost - there is an issue with efiboot.img, but I already have a
> solution, just not pushed it yet.
>
> You can find all the pull requests collected here:
> https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
>
> I'll work further to make the changes merged upstream.
>
> [1] https://reproducible-builds.org/
> [2] https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1e3e1006013772528078914f491d14c1f
> [3] https://reproducible-builds.org/specs/source-date-epoch/

Cool! Well you've already done most of the work and if this has
support elsewhere already then I'm in favor of continuing in that
direction.

I did give all of these things some thought a long time ago when I ran
into a lorax hack by Will Woods who used Btrfs as the root.img file
system, I'm not sure why it was used. But it gave me the idea of using
a few features built into Btrfs specifically for this use case:

- seed/sprout feature can be used with zram block device for volatile
overlay; and used with a blank partition on the stick for persistent
overlay. Discovery is part of the btrfs kernel code.

- Since metadata and data is always checksummed on every read, we
wouldn't have to depend on the slow and transient ISO checksum
(rd.live.check which uses checkisomd5) which likewise breaks when
creating a stick with livecd-iso-to-disk.

- Btrfs supports zstd compression. I did some testing and squashfs is
still a bit more efficient because it compresses fs metadata, whereas
Btrfs only compresses data extents.

The gotcha here is the resulting image isn't going to be bit for bit
reproducible: UUIDs and time stamps are strewn throughout the file
system (similar to ext4 and XFS), but any sufficiently complex file
system is going to have this problem. Off hand I'm not sure how
squashfs would get around it since it's going to draw from an ext4
source (not sure if the ephemeral root could be tmpfs and use it as
the source for mksquashfs?)

-- 
Chris Murphy
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx