Here's a shot at elaboration of usecase more: On various machines, we currently have a btrfs filesystem containing various volumes/snapshots containing starting states for containers. The snapshots are generated by common snapshot images built centrally. The machines, as the scheduler requests, start and stop containers on those volumes. We want to be able to start a container on a snapshot/volume such that every write to the relevant snapshot/volume is using a per-container key, but we don't care about reads of the starting snapshot image being encrypted since the starting snapshot image is widely shared. When the container is stopped, no future different container (or human or host program) knows its key. This limits the data which could be lost to a malicious service/human on the host to only the volumes containing currently running containers. Some other folks envision having a base image encrypted with some per-vendor key. Then the machine is rekeyed with a per-machine key in perhaps the TPM to use for updates and logfiles. When a user is created, a snapshot of a base homedir forms the base of their user subvolume/directory, which is then rekeyed with a per-user key. When the user logs in, systemd-homedir or the like could load their per-user key for their user subvolume/directory. Since we don't care about encrypting the common image, we initially envisioned unencrypted snapshot images where we then turn on encryption and have mixed unenc/enc data. The other usecase, though, really needs key change so that everything's encrypted. And the argument that mixed unenc/enc data is not safe was compelling. Hope that helps?Maybe a dumb question: why aren't you just using overlayfs? It's already possible to use overlayfs with an fscrypt-encrypted upperdir and workdir. When creating a new container you can create a new directory and assign it an fscrypt policy (with a per-container or per-user key or whatever that container wants), and create two subdirectories 'upperdir' and 'workdir' in it. Then just mount an overlayfs with that upperdir and workdir, and lowerdir referring to the starting rootfs. Then use that overlayfs as the rootfs as the container. Wouldn't that solve your use case exactly? Is there a reason you really want to create the container directly from a btrfs snapshot instead?
After talking it over, nested containers/subvols don't work easily with this scheme. Right now, one can make arbitrarily nested subvols inside of subvols, so e.g. a container which only sees /subvol can make subvol /subvol/nested without elevated permissions, and a container which only sees /subvol/nested could make yet another nested subvol /subvol/nested/foo/nested2, ad infinitum. There aren't afaik limits on the recursive depth of subvols or containers, or limits on how close they are in the directory tree.
This isn't purely theoretical; I learned today there are a couple of workloads internally which run in a long-lived container on a subvol, and spin up a bunch of short-lived containers on shortlived subvols inside the long-lived container/subvol.
I don't think the overlayfs scheme works with this. From the point of view of the container overlayfs would be presenting a wholly encrypted filesystem (which is what we want). But from the container, even if we plumbed through making a new subvol within, it'd be hard to create a new overlayfs upper directory with a new key for a nested container, if dirs had to have the same key as their parent dir unless that's unencrypted. We'd need to allow the parent container to escape into an unencrypted directory to make a new encrypted upperdir for the nested container, which would defeat having the container only able to write to encrypted locations. I can't come up with a way to make the overlayfs scheme work with this, but maybe I don't know overlayfs well enough.
A decidedly intriguing idea! Thanks Sweet Tea