On Mi, 03.11.21 13:52, Chris Murphy (lists@xxxxxxxxxxxxxxxxx) wrote: > There is a Discoverable Partitions Specification > http://systemd.io/DISCOVERABLE_PARTITIONS/ > > The problem with this for Btrfs, ZFS, and LVM is a single volume can > represent multiple use cases via multiple volumes: subvolumes (btrfs), > datasets (ZFS), and logical volumes (LVM). I'll just use the term > sub-volume for all of these, but I'm open to some other generic term. > > None of the above volume managers expose the equivalent of GPT's > partition type GUID per sub-volume. > > One possibility that's available right now is the sub-volume's name. > All we need is a spec for that naming convention. One of the strengths of the GPT arrangement is that we can very naturally use the type system to identify what kind of data something contains, and then use the gpt partition label to say what it's name is, and version (and we could encode more if we wanted). We use that to implement a very simple A/B logic in the image dissection logic of systemd-gpt-auto-generator, systemd-nspawn, systemd-dissect and so on: you can have multiple partitions named "foo-0.1", "foo-0.2", "foo-0.3" and so on, all of the same type 8484680c-9521-48c6-9c11-b0720656f69e (the type for /usr/ partitions ofr x86-64), and then we'll automatically pick the newest version "foo-0.3". hence, at the baseline any such spec should have similar concepts, and clearly be able to identify both type *and* name/version, otherwise it couldn't match the gpt spec feature-wise. > An early prototype of this idea was posted by Lennart: > https://0pointer.net/blog/revisiting-how-we-put-together-linux-systems.html Given that the gpt spec is reality and kinda established (in contrast to what the blog story describes) i'd really focus on adding a similar-in-spirit spec that picks up from there, and tries to minimize conceptual differences. Note that I'd distance any such spec from btrfs btw. btrfs subvolumes are in many ways regular directories. Thus I think the spec should only define how directories are supposed to be assembled, and if those directories are actually subvolumes great, but the spec can be entirely independent of that, i.e. it should be possible to implement it on ext4 and xfs too. (I personally think LVM — as an enterprise storage layer — is pretty uninteresting for any automatic handling like this in systemd though. If LVM wants automatic assembly they should do things themselves, I doubt systemd needs to care. Moreover, I have the impression that people who are into LVM and the pain it brings are probably not the type of people who like automatic handling like systemd-gpt-auto-generator brings it. – Yes, you might notice, I am not a fan of LVM. I don't think ZFS is interesting either, i.e. I wouldn't touch this with a 10m pole, given how unresolved their licensing mess is. But I'd recommend them to just implement the btrfs subvol ioctls, so that they could get the hookup for free. I understand their semantics are similar enough to make this possible.) I think implementation of a spec like this is not entirely trivial. The thing is that we can't determine what we need to do just by looking at the disk. We'd have to look for a specially marked root fs, and then mount it (which might first involve luks/integrity/… and thus interactivity), and then look into it, and then mount some dirs it includes in a new way. This is a substantially more complex logic — the GPT stuff is much simpler: we just look at the disk, figure things out, and then generate mount units for it. And that's really it. Anyway, I am not against this, I am mostly just saying that it isn't as easy as it might look to get this working robustly, i.e. the initrd probably would have to do things in multiple phases: first mount the relevant fs to /sysauto/ or so, and then after looking at this mount the right subdirs into /sysroot/ (as we usually do) and only then transition into it. Anyway, I think a spec like I'd do it today, taking all of the above into account would look a bit like this: 1. define a new gpt type uuid for these specially arranged "super-root" file systems (a single one for all archs). (i call this "super-root" to make clear that the it's not a regular root fs but one that contains potentially multiple in parallel) 2. inside this "super-root" fs, have one top-level dir, maybe called "@auto" or something like that. Why do this? two reasons: so that we can recognize an implementation of the spec both on the block level (via the gpt type id) and on the fs level (via this specially name top-level dir). The latter is interesting for potential MBR compat. And the other reason is if this is used on ext4 we don't get confused by lost+found. (also people could place whatever else they want in the root dir of the fs, for example ostree could do its thing in some other subdir of the root fs if it wants to) 3. Inside the "@auto" dir of the "super-root" fs, have dirs named <type>[:<namewithversion>]. The type should have a similar vocubulary as the GPT spec type UUIDs, but probably use textual identifiers rater than UUIDs, simply because naming dirs by uuids is weird. Examples: /@auto/root-x86-64:fedora_36.0/ /@auto/root-x86-64:fedora_36.1/ /@auto/root-x86-64:fedora_37.1/ /@auto/home/ /@auto/srv/ /@auto/tmp/ Which would be assembled by the initrd into the following via bind mounts: / → /@auto/root-x86-64:fedora_37.1/ /home/ → /@auto/home/ /srv/ → /@auto/srv/ /var/tmp/ → /@auto/tmp/ If we do this, then we should also leave the door open so that maybe ostree can be hooked up with this, i.e. if we allow the dirs in /@auto/ to actually be symlinks, then they could put their ostree checkotus wherever they want and then create a symlink /@auto/root-x86-64:myostreeos pointing to it, and their image would be spec conformant: we'd boot into that automatically, and so would nspawn and similar things. Thus they could switch their default OS to boot into without patching kernel cmdlines or such, simply by updating that symlink, and vanille systemd would know how to rearrange things. Lennart -- Lennart Poettering, Berlin