Re: [PATCH v1 00/12] fscrypt: add extent encryption

Sweet Tea Dorminy <sweettea-kernel@xxxxxxxxxx> · Mon, 3 Jul 2023 21:57:25 -0400

I do recall some discussion of making it possible to set an encryption policy on
an *unencrypted* directory, causing new files in that directory to be encrypted.
However, I don't recall any discussion of making it possible to add another
encryption policy to an *already-encrypted* directory.  I think this is the
first time this has been brought up.

I think I referenced it in the updated design (fifth paragraph of "Extent
encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing)
but I didn't talk about it enough -- 'rekeying' is a substitute for adding a
policy to a directory full of unencrypted data. Ya'll's points about the
badness of having mixed unencrypted and encrypted data in a single dir were
compelling. (As I recall it, the issue with having mixed enc/unenc data is
that a bug or attacker could point an encrypted file autostarted in a
container, say /container/my-service, at a unencrypted extent under their
control, say /bin/bash, and thereby acquire a backdoor.)

I think that allowing directories to have multiple encryption policies would
bring in a lot of complexity.  How would it be configured, and what would the
semantics for accessing it be?  Where would the encryption policies be stored?
What if you have added some of the keys but not all of them?  What if some of
the keys get removed but not all of them?

I'd planned to use add_enckey to add all the necessary keys, set_encpolicy
on an encrypted directory under the proper conditions (flags interpreted by
ioctl? check if filesystem has hook?) recursively calls a
filesystem-provided hook on each inode within to change the fscrypt_context.

That sounds quite complex.  Recursive directory operations aren't really
something the kernel does.  It would also require updating every inode, causing
COW of every inode.  Isn't that something you'd really like to avoid, to make
starting a new container as fast and lightweight as possible?

A fair point. Can move the penalty to open or write time instead though: 

btrfs could store a generation number with the new context on only the 

directory changed, then leaf inodes or new extent can traverse up the 

directory tree and grab context from the highest-generation-number 

directory in its path to inherit from. Or btrfs could disallow changing 

except on the base of a subvolume, and just go directly to the top of 

the subvolume to grab the appropriate context. Neither of those require 

recursion outside btrfs.

On various machines, we currently have a btrfs filesystem containing various
volumes/snapshots containing starting states for containers. The snapshots
are generated by common snapshot images built centrally. The machines, as
the scheduler requests, start and stop containers on those volumes.

We want to be able to start a container on a snapshot/volume such that every
write to the relevant snapshot/volume is using a per-container key, but we
don't care about reads of the starting snapshot image being encrypted since
the starting snapshot image is widely shared. When the container is stopped,
no future different container (or human or host program) knows its key. This
limits the data which could be lost to a malicious service/human on the host
to only the volumes containing currently running containers.

Some other folks envision having a base image encrypted with some per-vendor
key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
to use for updates and logfiles. When a user is created, a snapshot of a
base homedir forms the base of their user subvolume/directory, which is then
rekeyed with a per-user key. When the user logs in, systemd-homedir or the
like could load their per-user key for their user subvolume/directory.

Since we don't care about encrypting the common image, we initially
envisioned unencrypted snapshot images where we then turn on encryption and
have mixed unenc/enc data. The other usecase, though, really needs key
change so that everything's encrypted. And the argument that mixed unenc/enc
data is not safe was compelling.

Hope that helps?

Maybe a dumb question: why aren't you just using overlayfs?  It's already
possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
creating a new container you can create a new directory and assign it an fscrypt
policy (with a per-container or per-user key or whatever that container wants),
and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
an overlayfs with that upperdir and workdir, and lowerdir referring to the
starting rootfs.  Then use that overlayfs as the rootfs as the container.

Wouldn't that solve your use case exactly?  Is there a reason you really want to
create the container directly from a btrfs snapshot instead?

Hardly; a quite intriguing idea. Let me think about this with folks when 

we get back to work on Wednesday. Not sure how it goes with the other 

usecase, the base image/per-machine/per-user combo, but will think about it.