Re: Immutable Images: Single Data Patition

Lennart Poettering <lennart@xxxxxxxxxxxxxx> · Mon, 27 Feb 2023 17:16:48 +0100

On Di, 21.02.23 16:50, Adrian Vovk (adrianvovk@xxxxxxxxx) wrote:

> Part of the A/B approach involves two classes of user data partitions:
> ones that are encrypted (/var, /etc) and ones that are not (/home).
> I'll be adding an additional partition to the non-encrypted category,
> as part of my efforts to let Flatpak share the most common runtimes
> across installations [1]. Note that I'm only talking about the data
> partitions; the rootfs and/or usr partitions will be fixed size (~3gb
> each in my case, which should be ample headroom)

My thinking was that there would be basically five partitions for a
general purpose desktop OS:

1. ESP
2. /usr A
3. /usr B
4. root fs with /etc and /var, encrypted, TPM-bound
5. /home/ with dm-integrity or OPAL for trust, TPM-bound, with homed
   managed homedirs inside that do encryption

With that you'd have to figure out three sizes, i.e. how to size 2/3,
how to size 4 and how to size 5.

> Here's the issue: as a distributor I really have no idea how much
> space each of these partitions will need. /var can either be tiny
> (just storing some logs, some cache, other things like that) or it can
> be many GB (storing A/B versioned container images, sysexts, portable
> services, a system-wide Flatpak installation, etc). My Flatpak shared
> installations directory can be < 1GB (if the user uses few apps, or
> only GNOME apps, for instance) or much larger (if the user has some
> out-of-date apps, mixes GNOME/KDE apps, etc). The /home directory
> should probably by the biggest it could possibly be to maximize space
> for user files. I could pick some percentages, of course, but that
> would be a compromise that I don't think is strictly necessary and
> will leave most people unhappy.
>
> Letting the user choose a size themselves is also impractical. First
> of all, I don't think the user would necessarily know any better than
> I would how big they need these partitions to be: I've personally been
> bitten by running out of space in / with plenty of unused space in
> /home on Fedora [3]. In the event that the user guesses the wrong
> number, what then? Or maybe they guess the right number but then years
> later their requirements change: they find a new container-intensive
> hobby, or change employers, or any number of other things. I don't
> think it's a good answer to tell them to just reinstall; that is very
> inconvenient and not something they've ever had to do with the other
> OSs. It's a difficult solution on a technical level too. If I prompt
> the user at install-time, then devices with the OS pre-installed won't
> have the prompt & nor would a factory reset. I can ask on first boot,
> when the partition is being created, but then I'm going to have to do
> that from the initramfs (CLI? Plymouth? Put a GNOME session into the
> initrd?). Overall, I don't think this is a viable solution.
>
> I think this can be solved by having one user state partition w/ an
> encrypted loopback file in it. As far as I can tell this is the
> solution taken by ChromeOS [2]. There could then be a userspace
> service that dynamically resizes the file up and down as-needed and as
> space allows. I propose that we implement something similar in
> systemd. Here's what I think is necessary:

Hmm, interesting. So that would basically mean using a file system as
a volume manager. In a way that's what homed does anyway, I guess.

So this reminds me of something else that superficially sounds very
much unrelated but I think actually is the same issue:

Specifically, there's a TODO list item that was discussed multiple
times here for a spec how to arrange subvolumes on btrfs file systems,
i.e. defining a directory hierarchy that will be processed
automatically at boot by systemd and rearranged according to specific
rules: i.e. if a speciall marked fs has a subvolume "@var" or so we'd
automatuically mount it to /var. And if if it has a subvolume
@usr/x86-64/0.7, then we'd mount it to /usr/ if x86-64 is our local
arch, and 0.7 would be the newest version we find therein. I think a
lot of people agree that we want this, but noone sat down so far, to
come up with a comprehensive spec.

When this came up before, I already mentioned that we should not make
that spec btrfs-specific, but just say "directories" rather than
"subvolumes", since from a high-level perspective this is all we need.

But now coming back to this discussion, maybe we should take this one
level futher: if "@var" is not a directory/subvol, but a regular file,
we could just mount it as loopback file.

And suddenly we'd have a spec that would be particularly powerful and
generic: you could use it for subvols, for dirs, or for loopback
files, and mix and match freely, and it would always behave somewhat
the same way.

This would solve your problem and at the same time the btrfs
subvolume spec problem. I kinda love the idea.

> - Assign a new partition type UUID. I don't think it needs to be done per-arch
> - Have gpt-auto-generator automatically mount partitions w/ this UUID
> into /state
> - Once /state is mounted, have gpt-auto-generator automatically mount
> /state/encrypted.img onto /state/encrypted (decrypting it via TPM)
> - Expand systemd-repart to handle creating this /state partition and
> the /state/encrypted.img blob inside of it (a unit that runs
> systemd-repart to create the encrypted.img blob could probably
> suffice? Or mkfs/repart/whatever automatically creates
> /state/encrypted.img as-needed so that it works with --root or
> --image).
> - Implement a userspace service that grows/shrinks the
> /state/encrypted.img file as needed. We could probably reuse parts of
> homework here??

Sounds more or less OK.

BTW, we recently discussed another approach (and added it to the TODO
list, actually). Android ran into a similar problem, so they just went
the other way and decided to use "dm-linear" to be able to extend
partitions. i.e. use the DM layer as volume manage, like the gods
intended, but in a very minimal fashion. We weere thinking of doing
hte system in the dissection/repart logic. Specifically, we'd have a
new gpt part type called "extension partition", which would be able to
extend a specific partition earlier in the partition table. The gpt
partition uuid would be a counter-mode hash of the first partition in
the chain. When activating that we'd simply combine them all with
dm-linear. Thus, if we want to extend a partition that we cannot just
grow because there's another partition right behind it, we'd instead
create a new "extension" partition at the end of the disk, and then
chain them up.

Interesting that ChromeOS and Android came to different solutions
there.

Lennart

--
Lennart Poettering, Berlin