I do recall some discussion of making it possible to set an encryption policy on an *unencrypted* directory, causing new files in that directory to be encrypted. However, I don't recall any discussion of making it possible to add another encryption policy to an *already-encrypted* directory. I think this is the first time this has been brought up.I think I referenced it in the updated design (fifth paragraph of "Extent encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing) but I didn't talk about it enough -- 'rekeying' is a substitute for adding a policy to a directory full of unencrypted data. Ya'll's points about the badness of having mixed unencrypted and encrypted data in a single dir were compelling. (As I recall it, the issue with having mixed enc/unenc data is that a bug or attacker could point an encrypted file autostarted in a container, say /container/my-service, at a unencrypted extent under their control, say /bin/bash, and thereby acquire a backdoor.)I think that allowing directories to have multiple encryption policies would bring in a lot of complexity. How would it be configured, and what would the semantics for accessing it be? Where would the encryption policies be stored? What if you have added some of the keys but not all of them? What if some of the keys get removed but not all of them?I'd planned to use add_enckey to add all the necessary keys, set_encpolicy on an encrypted directory under the proper conditions (flags interpreted by ioctl? check if filesystem has hook?) recursively calls a filesystem-provided hook on each inode within to change the fscrypt_context.That sounds quite complex. Recursive directory operations aren't really something the kernel does. It would also require updating every inode, causing COW of every inode. Isn't that something you'd really like to avoid, to make starting a new container as fast and lightweight as possible?
A fair point. Can move the penalty to open or write time instead though: btrfs could store a generation number with the new context on only the directory changed, then leaf inodes or new extent can traverse up the directory tree and grab context from the highest-generation-number directory in its path to inherit from. Or btrfs could disallow changing except on the base of a subvolume, and just go directly to the top of the subvolume to grab the appropriate context. Neither of those require recursion outside btrfs.
On various machines, we currently have a btrfs filesystem containing various volumes/snapshots containing starting states for containers. The snapshots are generated by common snapshot images built centrally. The machines, as the scheduler requests, start and stop containers on those volumes. We want to be able to start a container on a snapshot/volume such that every write to the relevant snapshot/volume is using a per-container key, but we don't care about reads of the starting snapshot image being encrypted since the starting snapshot image is widely shared. When the container is stopped, no future different container (or human or host program) knows its key. This limits the data which could be lost to a malicious service/human on the host to only the volumes containing currently running containers. Some other folks envision having a base image encrypted with some per-vendor key. Then the machine is rekeyed with a per-machine key in perhaps the TPM to use for updates and logfiles. When a user is created, a snapshot of a base homedir forms the base of their user subvolume/directory, which is then rekeyed with a per-user key. When the user logs in, systemd-homedir or the like could load their per-user key for their user subvolume/directory. Since we don't care about encrypting the common image, we initially envisioned unencrypted snapshot images where we then turn on encryption and have mixed unenc/enc data. The other usecase, though, really needs key change so that everything's encrypted. And the argument that mixed unenc/enc data is not safe was compelling. Hope that helps?Maybe a dumb question: why aren't you just using overlayfs? It's already possible to use overlayfs with an fscrypt-encrypted upperdir and workdir. When creating a new container you can create a new directory and assign it an fscrypt policy (with a per-container or per-user key or whatever that container wants), and create two subdirectories 'upperdir' and 'workdir' in it. Then just mount an overlayfs with that upperdir and workdir, and lowerdir referring to the starting rootfs. Then use that overlayfs as the rootfs as the container. Wouldn't that solve your use case exactly? Is there a reason you really want to create the container directly from a btrfs snapshot instead?
Hardly; a quite intriguing idea. Let me think about this with folks when we get back to work on Wednesday. Not sure how it goes with the other usecase, the base image/per-machine/per-user combo, but will think about it.