On Wed, Jul 05, 2023 at 08:13:34AM -0400, Neal Gompa wrote: > On Mon, Jul 3, 2023 at 10:03 PM Sweet Tea Dorminy > <sweettea-kernel@xxxxxxxxxx> wrote: > > > > > > >>> I do recall some discussion of making it possible to set an encryption policy on > > >>> an *unencrypted* directory, causing new files in that directory to be encrypted. > > >>> However, I don't recall any discussion of making it possible to add another > > >>> encryption policy to an *already-encrypted* directory. I think this is the > > >>> first time this has been brought up. > > >> > > >> I think I referenced it in the updated design (fifth paragraph of "Extent > > >> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing) > > >> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a > > >> policy to a directory full of unencrypted data. Ya'll's points about the > > >> badness of having mixed unencrypted and encrypted data in a single dir were > > >> compelling. (As I recall it, the issue with having mixed enc/unenc data is > > >> that a bug or attacker could point an encrypted file autostarted in a > > >> container, say /container/my-service, at a unencrypted extent under their > > >> control, say /bin/bash, and thereby acquire a backdoor.) > > >>> > > >>> I think that allowing directories to have multiple encryption policies would > > >>> bring in a lot of complexity. How would it be configured, and what would the > > >>> semantics for accessing it be? Where would the encryption policies be stored? > > >>> What if you have added some of the keys but not all of them? What if some of > > >>> the keys get removed but not all of them? > > >> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy > > >> on an encrypted directory under the proper conditions (flags interpreted by > > >> ioctl? check if filesystem has hook?) recursively calls a > > >> filesystem-provided hook on each inode within to change the fscrypt_context. > > > > > > That sounds quite complex. Recursive directory operations aren't really > > > something the kernel does. It would also require updating every inode, causing > > > COW of every inode. Isn't that something you'd really like to avoid, to make > > > starting a new container as fast and lightweight as possible? > > > > A fair point. Can move the penalty to open or write time instead though: > > btrfs could store a generation number with the new context on only the > > directory changed, then leaf inodes or new extent can traverse up the > > directory tree and grab context from the highest-generation-number > > directory in its path to inherit from. Or btrfs could disallow changing > > except on the base of a subvolume, and just go directly to the top of > > the subvolume to grab the appropriate context. Neither of those require > > recursion outside btrfs. > > > > >> On various machines, we currently have a btrfs filesystem containing various > > >> volumes/snapshots containing starting states for containers. The snapshots > > >> are generated by common snapshot images built centrally. The machines, as > > >> the scheduler requests, start and stop containers on those volumes. > > >> > > >> We want to be able to start a container on a snapshot/volume such that every > > >> write to the relevant snapshot/volume is using a per-container key, but we > > >> don't care about reads of the starting snapshot image being encrypted since > > >> the starting snapshot image is widely shared. When the container is stopped, > > >> no future different container (or human or host program) knows its key. This > > >> limits the data which could be lost to a malicious service/human on the host > > >> to only the volumes containing currently running containers. > > >> > > >> Some other folks envision having a base image encrypted with some per-vendor > > >> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM > > >> to use for updates and logfiles. When a user is created, a snapshot of a > > >> base homedir forms the base of their user subvolume/directory, which is then > > >> rekeyed with a per-user key. When the user logs in, systemd-homedir or the > > >> like could load their per-user key for their user subvolume/directory. > > >> > > >> Since we don't care about encrypting the common image, we initially > > >> envisioned unencrypted snapshot images where we then turn on encryption and > > >> have mixed unenc/enc data. The other usecase, though, really needs key > > >> change so that everything's encrypted. And the argument that mixed unenc/enc > > >> data is not safe was compelling. > > >> > > >> Hope that helps? > > > > > > Maybe a dumb question: why aren't you just using overlayfs? It's already > > > possible to use overlayfs with an fscrypt-encrypted upperdir and workdir. When > > > creating a new container you can create a new directory and assign it an fscrypt > > > policy (with a per-container or per-user key or whatever that container wants), > > > and create two subdirectories 'upperdir' and 'workdir' in it. Then just mount > > > an overlayfs with that upperdir and workdir, and lowerdir referring to the > > > starting rootfs. Then use that overlayfs as the rootfs as the container. > > > > > > Wouldn't that solve your use case exactly? Is there a reason you really want to > > > create the container directly from a btrfs snapshot instead? > > > > Hardly; a quite intriguing idea. Let me think about this with folks when > > we get back to work on Wednesday. Not sure how it goes with the other > > usecase, the base image/per-machine/per-user combo, but will think about it. > > I like creating containers directly based on my host system for > development and destructive purposes. It saves space and is incredibly > useful. A solution for that already exists. It's called btrfs snapshots. Which you probably already know, since it's probably what you're using :-) Using overlayfs would simply mean that each container consists of an upper and lower directory instead of a single directory. Either or both could still be btrfs subvolumes. They could even be on the same subvolume. > > But the layered key encryption thing is also core to the encryption > strategy we want to take in Fedora, so I would really like to see this > be possible with Btrfs encryption. > > Critically, it means that unlocking a user subvolume will always be > multi-factor: something you have (machine key) and something you know > (user credentials). That's possible with the existing fscrypt semantics. Just use a unique master key for each container, and protect it with a key derived from both the machine key *and* the user credential. Protecting the fscrypt master key(s) is a userspace problem, not a kernel one. The kernel just receives the raw key. - Eric