Re: Installation image layout

Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx> · Fri, 12 Oct 2018 12:30:14 +0200

On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> On Thu, Oct 11, 2018 at 6:37 PM, Marek Marczykowski-Górecki
> <marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> > Hi all!
> >
> > I'm new on this list. I work on Qubes OS, where Fedora is used as a base
> > distribution.
> >
> > While trying to build the installation image in reproducible manner[1],
> > I found the current installation image have unusual layout. Quoting
> > dracut.cmdline manual page:
> >
> >        squashfs.img          |  Squashfs from LiveCD .iso downloaded via network
> >           !(mount)
> >           /LiveOS
> >               |- rootfs.img  |  Filesystem image to mount read-only
> >                    !(mount)
> >                    /bin      |  Live filesystem
> >                    /boot     |
> >                    /dev      |
> >                    ...       |
> >
> > This rootfs.img layer makes the image build very much unreproducible.
> > Why is it even there? Bare squashfs.img layer should be enough. Then,
> > mount overlayfs over it (I see there is even some partial support for it
> > in dmsquash-live). Most other Live systems I've seen use just squashfs +
> > overlayfs (or aufs if kernel is older), so it's commonly tested
> > configuration. I *guess* it's there for historical reason, from before
> > aufs/overlayfs being available. Is there any other reason for that?
> 
> I'm pretty sure the original reason was the default live install use
> dd to block copy the root file system into the fedora-root LV, and
> then resized the LV and ext4 file system.

How is it done now?

> There have also been a
> number of squashfs improvements since that decision so there might
> have been limitations with squashfs that ext4 didn't have (I'm
> thinking xattr were long supported in ext4 before squashfs, and maybe
> capabilities?)
> 
> >
> > If there is no other reason, I propose to drop this and have
> > installer/live filesystem directly in squashfs.img. This have multiple
> > benefits:
> >  - it's much easier to make the image build process reproducible (see
> >    below)
> >  - less complexity, both in the build and in the boot (the whole
> >    dmsquash-live dracut module can be replaced with <20 line
> >    function[2]
> >  - smaller initramfs (which is extremely important if needed to be
> >    included in efiboot.img, which can't be larger than 32MB)
> >  - slightly faster boot time (device-mapper is slow)
> >
> > What do you think?
> 
> Whatever we do should take into account the persistent root and
> persistent home use cases, specifically:
> https://github.com/livecd-tools/livecd-tools/blob/master/tools/livecd-iso-to-disk.sh
> 
> --overlay-size-mb
> --home-size-mb
> 
> A particular criticism of the device-mapper solution currently being
> used is in that script: it blows up. Literally it's WORM, and deleting
> files simply dereferences them, it doesn't free up pool space, so it
> is inevitable that the pool will fill up, and when it does it blows up
> the file system, and it can't be repaired. All you can do is reset the
> overlay which means deleting all changes and starting over.
> 
> At least one of our spins, SOAS, depends on livecd-iso-to-disk for
> creating their final installation because it's predicated on running
> Fedora SOAS from a stick.
> 
> Why does efiboot.img have a 32MiB limit?

Because "32MB should be enough for everybody"...
Long story short, "El Torito" boot catalog structure have 16-bit field
for image size (expressed in 512-bytes sectors). For details see here:
https://wiki.osdev.org/El-Torito
https://web.archive.org/web/20180112220141/https://download.intel.com/support/motherboards/desktop/sb/specscdrom.pdf
(page 10)

Full story:
https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806

I've spent a lot of time debugging this, because mkisofs doesn't
complain about it, just silently overflow higher bits to adjacent field,
which results in weird results, depending on where you boot it. Adding
isohybrid to the picture doesn't make it easier (there, higher bits are
truncated, or actually not copied to the MBR partition table, as wasn't
part of the original field).

> > As for the reproducibility, I've made changes to lorax (including
> > dropping rootfs.img layer), anaconda, pungi and createrepo and this all
> > allows to build bit-by-bit identical image, given the same input (rpm
> > packages, pungi configuration, $SOURCE_DATE_EPOCH variable[3]). Well,
> > almost - there is an issue with efiboot.img, but I already have a
> > solution, just not pushed it yet.
> >
> > You can find all the pull requests collected here:
> > https://github.com/QubesOS/qubes-installer-qubes-os/pull/26
> >
> > I'll work further to make the changes merged upstream.
> >
> > [1] https://reproducible-builds.org/
> > [2] https://github.com/QubesOS/qubes-installer-qubes-os/pull/26/commits/332be8e1e3e1006013772528078914f491d14c1f
> > [3] https://reproducible-builds.org/specs/source-date-epoch/
> 
> Cool! Well you've already done most of the work and if this has
> support elsewhere already then I'm in favor of continuing in that
> direction.
> 
> I did give all of these things some thought a long time ago when I ran
> into a lorax hack by Will Woods who used Btrfs as the root.img file
> system, I'm not sure why it was used. But it gave me the idea of using
> a few features built into Btrfs specifically for this use case:
> 
> - seed/sprout feature can be used with zram block device for volatile
> overlay; and used with a blank partition on the stick for persistent
> overlay. Discovery is part of the btrfs kernel code.
> 
> - Since metadata and data is always checksummed on every read, we
> wouldn't have to depend on the slow and transient ISO checksum
> (rd.live.check which uses checkisomd5) which likewise breaks when
> creating a stick with livecd-iso-to-disk.
> 
> - Btrfs supports zstd compression. I did some testing and squashfs is
> still a bit more efficient because it compresses fs metadata, whereas
> Btrfs only compresses data extents.
> 
> The gotcha here is the resulting image isn't going to be bit for bit
> reproducible: UUIDs and time stamps are strewn throughout the file
> system (similar to ext4 and XFS), but any sufficiently complex file
> system is going to have this problem.

I wouldn't worry about _files_ timestamps that much - in most cases this is
solvable problem by elaborate enough find+touch[4]. But that's not all
obviously, there are various timestamps in superblock, and other
metadata. The most problematic part in "normal" filesystems, using
kernel driver is inode allocation, block allocation etc. This greatly
depends on timing, ordering, specific kernel version etc.
See [5] for details.

> Off hand I'm not sure how
> squashfs would get around it since it's going to draw from an ext4
> source (not sure if the ephemeral root could be tmpfs and use it as
> the source for mksquashfs?)

mksquashfs 5.0-rc1 have support for clamping mtime to $SOURCE_DATE_EPOCH
variable[3]. And the other metadata is reproducible already in mksquashfs
4.3 (I think files are sorted or similar approach is taken).

TBH, there is also a tool to build ext4 filesystem reproducible, not
using kernel driver. It's make_ext4 from OpenWRT projet. But I still
think it would be better to drop that layer anyway.

[4] https://reproducible-builds.org/docs/archives/
[5] https://reproducible-builds.org/docs/system-images/

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx