Re: Installation image layout

Marek Marczykowski-Górecki <marmarek@xxxxxxxxxxxxxxxxxxxxxx> · Sat, 13 Oct 2018 01:26:17 +0200

On Fri, Oct 12, 2018 at 03:44:38PM -0600, Chris Murphy wrote:
> On Fri, Oct 12, 2018 at 4:30 AM, Marek Marczykowski-Górecki
> <marmarek@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> > On Thu, Oct 11, 2018 at 09:24:08PM -0600, Chris Murphy wrote:
> >> Why does efiboot.img have a 32MiB limit?
> >
> > Because "32MB should be enough for everybody"...
> > Long story short, "El Torito" boot catalog structure have 16-bit field
> > for image size (expressed in 512-bytes sectors). For details see here:
> > https://wiki.osdev.org/El-Torito
> > https://web.archive.org/web/20180112220141/https://download.intel.com/support/motherboards/desktop/sb/specscdrom.pdf
> > (page 10)
> 
> OK. On Fedora 28 media, efiboot.img is ~9.2 MiB and does not contain
> either the kernel or initramfs.

I know, this particular problem was specific to Qubes OS, where
kernel+initramfs needed to be on ESP, because of Xen+EFI limitation
(basically kernel needs to be loaded through through UEFI instead of
by grub, so it needs to live on something that UEFI understands). And
actually recent Xen version doesn't have this limitation anymore (at
least in theory...). This is just a bit of context from where it all
got here, much less relevant today.

(...)

> > Full story:
> > https://github.com/QubesOS/qubes-issues/issues/794#issuecomment-135988806
> >
> > I've spent a lot of time debugging this, because mkisofs doesn't
> > complain about it, just silently overflow higher bits to adjacent field,
> > which results in weird results, depending on where you boot it. Adding
> > isohybrid to the picture doesn't make it easier (there, higher bits are
> > truncated, or actually not copied to the MBR partition table, as wasn't
> > part of the original field).
> 
> 
> I think we're stuck with isohybrid for a while. Having UEFI and BIOS
> bootloaders, along with isohybrid supporting both as well as Macs, all
> on one media image, that can be burned to optical media and written to
> a USB stick - is hugely beneficial.

I have no problem with isohybrid alone. It's major hack, but definitely
worth it.

> The compose process takes about 12 hours. That every ISO for all the
> editions, and the spins, and the VM images, for all archs. Even having
> separate UEFI and BIOS images, or splitting out Macs with their own
> image, it'll increase compose times and complexity across the board.

And also complexity for the users - which image to download. I totally
understand why it is beneficial.

(...)

> >> I did give all of these things some thought a long time ago when I ran
> >> into a lorax hack by Will Woods who used Btrfs as the root.img file
> >> system, I'm not sure why it was used. But it gave me the idea of using
> >> a few features built into Btrfs specifically for this use case:
> >>
> >> - seed/sprout feature can be used with zram block device for volatile
> >> overlay; and used with a blank partition on the stick for persistent
> >> overlay. Discovery is part of the btrfs kernel code.
> >>
> >> - Since metadata and data is always checksummed on every read, we
> >> wouldn't have to depend on the slow and transient ISO checksum
> >> (rd.live.check which uses checkisomd5) which likewise breaks when
> >> creating a stick with livecd-iso-to-disk.
> >>
> >> - Btrfs supports zstd compression. I did some testing and squashfs is
> >> still a bit more efficient because it compresses fs metadata, whereas
> >> Btrfs only compresses data extents.
> >>
> >> The gotcha here is the resulting image isn't going to be bit for bit
> >> reproducible: UUIDs and time stamps are strewn throughout the file
> >> system (similar to ext4 and XFS), but any sufficiently complex file
> >> system is going to have this problem.
> >
> > I wouldn't worry about _files_ timestamps that much - in most cases this is
> > solvable problem by elaborate enough find+touch[4]. But that's not all
> > obviously, there are various timestamps in superblock, and other
> > metadata. The most problematic part in "normal" filesystems, using
> > kernel driver is inode allocation, block allocation etc. This greatly
> > depends on timing, ordering, specific kernel version etc.
> > See [5] for details.
> 
> 
> mkfs.btrfs has --rootdir and --shrink features to pre-allocate a
> volume with files at mkfs time; I have no idea to what degree it
> depends on kernel code. 

Probably not at all, given it works as non-root user too.
I've tried to run it twice on the same directory (and with the same
--uuid) on 32MB of data and got different images (~2000 lines of hexdump
diff). Could be some timestamps, could be something else.

> The main benefit with this is it's really easy
> to implement full checksum matching for metadata and data on every
> read, and user space ends up with EIO instead of corrupt data, and
> super clear kernel complaints. And such corruption whether on optical
> or USB sticks, is common. Even the more rare case of a stick that
> passes md5 checksum, can later have transient and silent corruption
> that ends up showing up in weird ways.
> 
> It's plausible squashfs could implement this, I think by default it
> already checksums every file to look for duplicates, but it doesn't
> retain the per file hash for integrity checking later on. 

Indeed it looks that way. I'm able to make one-byte modification to the
image file resulting in different files (diff -r), but no read error. I
wonder if integrity checking is something on squashfs roadmap...

> It's also
> possible with dm-verity or dm-integrity but then that adds back the dm
> complexity.

Oh, please, no...

There are two almost separate aspects here:
 - image layout (squashfs+ext4, squashfs alone, squashfs+btrfs)
 - how copy-on-write is achieved (dm-snapshot, overlay fs)

For reproducibility, squashfs alone is the best option, but does not
improve integrity checking (but also doesn't make it worse).
For integrity checking, squashfs+btrfs may be better, but doesn't help
that much with reproducibility. Maybe even make it worse, because
mkfs.btrfs also make not reproducible result, while make_ext4 (do not
confuse with mkfs.ext4!) is reproducible. Not being packaged for Fedora
is only a small issue here.

As for copy-on-write, dm-snapshot is quite complex to setup and require
underlying FS to support write. Also, doesn't allow to write more data
than original image size (may be an issue for persistent partition
case). Overlay fs on the other hand works with any underlying fs, you
can write as much data as you want. And in case of persistent partition,
you can access that data even if base image (the lower layer) is
unavailable/broken. I think the only downside of overlay fs is when you
modify large file it gets copied in full to the upper layer. But I don't
think that's an issue in this use case.

For me, overlay fs is a clear winner here.
But as for image layout, it isn't that simple. For reproducibility,
squashfs alone is better. But if the goal of this change would be also
improving read errors detection, then it isn't that clear anymore. It
may be that it takes a simple mkfs.btrfs patch to make it reproducible,
but it isn't obvious for me at this stage. Also, keeping two layers
looks like unnecessary complexity.

What do you think about sidestepping this discussion a little and
replacing dm-snapshot with overlay fs regardless of other changes here?
That should be doable without any change to image format and will give
more flexibility there.
Then, it could be even made to support both 1-layer and 2-layer formats
at the same time (depending on rootfs.img presence). Something that
isn't possible with dm-snapshot right now.

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
Attachment:
signature.asc

Description: PGP signature
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/devel@xxxxxxxxxxxxxxxxxxxxxxx