On Mon, Jul 3, 2023 at 10:03 PM Sweet Tea Dorminy <sweettea-kernel@xxxxxxxxxx> wrote: > > > >>> I do recall some discussion of making it possible to set an encryption policy on > >>> an *unencrypted* directory, causing new files in that directory to be encrypted. > >>> However, I don't recall any discussion of making it possible to add another > >>> encryption policy to an *already-encrypted* directory. I think this is the > >>> first time this has been brought up. > >> > >> I think I referenced it in the updated design (fifth paragraph of "Extent > >> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing) > >> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a > >> policy to a directory full of unencrypted data. Ya'll's points about the > >> badness of having mixed unencrypted and encrypted data in a single dir were > >> compelling. (As I recall it, the issue with having mixed enc/unenc data is > >> that a bug or attacker could point an encrypted file autostarted in a > >> container, say /container/my-service, at a unencrypted extent under their > >> control, say /bin/bash, and thereby acquire a backdoor.) > >>> > >>> I think that allowing directories to have multiple encryption policies would > >>> bring in a lot of complexity. How would it be configured, and what would the > >>> semantics for accessing it be? Where would the encryption policies be stored? > >>> What if you have added some of the keys but not all of them? What if some of > >>> the keys get removed but not all of them? > >> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy > >> on an encrypted directory under the proper conditions (flags interpreted by > >> ioctl? check if filesystem has hook?) recursively calls a > >> filesystem-provided hook on each inode within to change the fscrypt_context. > > > > That sounds quite complex. Recursive directory operations aren't really > > something the kernel does. It would also require updating every inode, causing > > COW of every inode. Isn't that something you'd really like to avoid, to make > > starting a new container as fast and lightweight as possible? > > A fair point. Can move the penalty to open or write time instead though: > btrfs could store a generation number with the new context on only the > directory changed, then leaf inodes or new extent can traverse up the > directory tree and grab context from the highest-generation-number > directory in its path to inherit from. Or btrfs could disallow changing > except on the base of a subvolume, and just go directly to the top of > the subvolume to grab the appropriate context. Neither of those require > recursion outside btrfs. > > >> On various machines, we currently have a btrfs filesystem containing various > >> volumes/snapshots containing starting states for containers. The snapshots > >> are generated by common snapshot images built centrally. The machines, as > >> the scheduler requests, start and stop containers on those volumes. > >> > >> We want to be able to start a container on a snapshot/volume such that every > >> write to the relevant snapshot/volume is using a per-container key, but we > >> don't care about reads of the starting snapshot image being encrypted since > >> the starting snapshot image is widely shared. When the container is stopped, > >> no future different container (or human or host program) knows its key. This > >> limits the data which could be lost to a malicious service/human on the host > >> to only the volumes containing currently running containers. > >> > >> Some other folks envision having a base image encrypted with some per-vendor > >> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM > >> to use for updates and logfiles. When a user is created, a snapshot of a > >> base homedir forms the base of their user subvolume/directory, which is then > >> rekeyed with a per-user key. When the user logs in, systemd-homedir or the > >> like could load their per-user key for their user subvolume/directory. > >> > >> Since we don't care about encrypting the common image, we initially > >> envisioned unencrypted snapshot images where we then turn on encryption and > >> have mixed unenc/enc data. The other usecase, though, really needs key > >> change so that everything's encrypted. And the argument that mixed unenc/enc > >> data is not safe was compelling. > >> > >> Hope that helps? > > > > Maybe a dumb question: why aren't you just using overlayfs? It's already > > possible to use overlayfs with an fscrypt-encrypted upperdir and workdir. When > > creating a new container you can create a new directory and assign it an fscrypt > > policy (with a per-container or per-user key or whatever that container wants), > > and create two subdirectories 'upperdir' and 'workdir' in it. Then just mount > > an overlayfs with that upperdir and workdir, and lowerdir referring to the > > starting rootfs. Then use that overlayfs as the rootfs as the container. > > > > Wouldn't that solve your use case exactly? Is there a reason you really want to > > create the container directly from a btrfs snapshot instead? > > Hardly; a quite intriguing idea. Let me think about this with folks when > we get back to work on Wednesday. Not sure how it goes with the other > usecase, the base image/per-machine/per-user combo, but will think about it. I like creating containers directly based on my host system for development and destructive purposes. It saves space and is incredibly useful. But the layered key encryption thing is also core to the encryption strategy we want to take in Fedora, so I would really like to see this be possible with Btrfs encryption. Critically, it means that unlocking a user subvolume will always be multi-factor: something you have (machine key) and something you know (user credentials). -- 真実はいつも一つ!/ Always, there's only one truth!