Re: the need for a discoverable sub-volumes specification

Lennart Poettering <lennart@xxxxxxxxxxxxxx> · Thu, 4 Nov 2021 14:39:12 +0100

On Mi, 03.11.21 13:52, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote:

> There is a Discoverable Partitions Specification
> http://systemd.io/DISCOVERABLE_PARTITIONS/
>
> The problem with this for Btrfs, ZFS, and LVM is a single volume can
> represent multiple use cases via multiple volumes: subvolumes (btrfs),
> datasets (ZFS), and logical volumes (LVM). I'll just use the term
> sub-volume for all of these, but I'm open to some other generic term.
>
> None of the above volume managers expose the equivalent of GPT's
> partition type GUID per sub-volume.
>
> One possibility that's available right now is the sub-volume's name.
> All we need is a spec for that naming convention.

One of the strengths of the GPT arrangement is that we can very
naturally use the type system to identify what kind of data something
contains, and then use the gpt partition label to say what it's name
is, and version (and we could encode more if we wanted). We use that
to implement a very simple A/B logic in the image dissection logic of
systemd-gpt-auto-generator, systemd-nspawn, systemd-dissect and so on:
you can have multiple partitions named "foo-0.1", "foo-0.2", "foo-0.3"
and so on, all of the same type 8484680c-9521-48c6-9c11-b0720656f69e
(the type for /usr/ partitions ofr x86-64), and then we'll
automatically pick the newest version "foo-0.3".

hence, at the baseline any such spec should have similar concepts, and
clearly be able to identify both type *and* name/version, otherwise it
couldn't match the gpt spec feature-wise.

> An early prototype of this idea was posted by Lennart:
> https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html

Given that the gpt spec is reality and kinda established (in contrast
to what the blog story describes) i'd really focus on adding a
similar-in-spirit spec that picks up from there, and tries to minimize
conceptual differences.

Note that I'd distance any such spec from btrfs btw. btrfs subvolumes
are in many ways regular directories. Thus I think the spec should
only define how directories are supposed to be assembled, and if those
directories are actually subvolumes great, but the spec can be
entirely independent of that, i.e. it should be possible to implement
it on ext4 and xfs too.

(I personally think LVM — as an enterprise storage layer — is pretty
uninteresting for any automatic handling like this in systemd
though. If LVM wants automatic assembly they should do things
themselves, I doubt systemd needs to care. Moreover, I have the
impression that people who are into LVM and the pain it brings are
probably not the type of people who like automatic handling like
systemd-gpt-auto-generator brings it. – Yes, you might notice, I am
not a fan of LVM. I don't think ZFS is interesting either, i.e. I
wouldn't touch this with a 10m pole, given how unresolved their
licensing mess is. But I'd recommend them to just implement the btrfs
subvol ioctls, so that they could get the hookup for free. I
understand their semantics are similar enough to make this possible.)

I think implementation of a spec like this is not entirely
trivial. The thing is that we can't determine what we need to do just
by looking at the disk. We'd have to look for a specially marked root
fs, and then mount it (which might first involve luks/integrity/… and
thus interactivity), and then look into it, and then mount some dirs
it includes in a new way. This is a substantially more complex logic —
the GPT stuff is much simpler: we just look at the disk, figure things
out, and then generate mount units for it. And that's really it.

Anyway, I am not against this, I am mostly just saying that it isn't
as easy as it might look to get this working robustly, i.e. the initrd
probably would have to do things in multiple phases: first mount the
relevant fs to /sysauto/ or so, and then after looking at this mount
the right subdirs into /sysroot/ (as we usually do) and only then
transition into it.

Anyway, I think a spec like I'd do it today, taking all of the above
into account would look a bit like this:

1. define a new gpt type uuid for these specially arranged "super-root" file
   systems (a single one for all archs). (i call this "super-root" to
   make clear that the it's not a regular root fs but one that
   contains potentially multiple in parallel)

2. inside this "super-root" fs, have one top-level dir, maybe called
   "@auto" or something like that. Why do this? two reasons: so that
   we can recognize an implementation of the spec both on the block
   level (via the gpt type id) and on the fs level (via this specially
   name top-level dir). The latter is interesting for potential MBR
   compat. And the other reason is if this is used on ext4 we don't
   get confused by lost+found. (also people could place whatever else
   they want in the root dir of the fs, for example ostree could do
   its thing in some other subdir of the root fs if it wants to)

3. Inside the "@auto" dir of the "super-root" fs, have dirs named
   <type>[:<namewithversion>]. The type should have a similar vocubulary
   as the GPT spec type UUIDs, but probably use textual identifiers
   rater than UUIDs, simply because naming dirs by uuids is
   weird. Examples:

   /@auto/root-x86-64:fedora_36.0/
   /@auto/root-x86-64:fedora_36.1/
   /@auto/root-x86-64:fedora_37.1/
   /@auto/home/
   /@auto/srv/
   /@auto/tmp/

   Which would be assembled by the initrd into the following via bind
   mounts:

   /         → /@auto/root-x86-64:fedora_37.1/
   /home/    → /@auto/home/
   /srv/     → /@auto/srv/
   /var/tmp/ → /@auto/tmp/

If we do this, then we should also leave the door open so that maybe
ostree can be hooked up with this, i.e. if we allow the dirs in
/@auto/ to actually be symlinks, then they could put their ostree
checkotus wherever they want and then create a symlink
/@auto/root-x86-64:myostreeos pointing to it, and their image would be
spec conformant: we'd boot into that automatically, and so would
nspawn and similar things. Thus they could switch their default OS to
boot into without patching kernel cmdlines or such, simply by updating
that symlink, and vanille systemd would know how to rearrange things.

Lennart

--
Lennart Poettering, Berlin