Re: [PATCH v1 00/12] fscrypt: add extent encryption

Eric Biggers <ebiggers@xxxxxxxxxx> · Wed, 5 Jul 2023 09:28:08 -0700

On Wed, Jul 05, 2023 at 08:13:34AM -0400, Neal Gompa wrote:
> On Mon, Jul 3, 2023 at 10:03 PM Sweet Tea Dorminy
> <sweettea-kernel@xxxxxxxxxx> wrote:
> >
> >
> > >>> I do recall some discussion of making it possible to set an encryption policy on
> > >>> an *unencrypted* directory, causing new files in that directory to be encrypted.
> > >>> However, I don't recall any discussion of making it possible to add another
> > >>> encryption policy to an *already-encrypted* directory.  I think this is the
> > >>> first time this has been brought up.
> > >>
> > >> I think I referenced it in the updated design (fifth paragraph of "Extent
> > >> encryption" https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing)
> > >> but I didn't talk about it enough -- 'rekeying' is a substitute for adding a
> > >> policy to a directory full of unencrypted data. Ya'll's points about the
> > >> badness of having mixed unencrypted and encrypted data in a single dir were
> > >> compelling. (As I recall it, the issue with having mixed enc/unenc data is
> > >> that a bug or attacker could point an encrypted file autostarted in a
> > >> container, say /container/my-service, at a unencrypted extent under their
> > >> control, say /bin/bash, and thereby acquire a backdoor.)
> > >>>
> > >>> I think that allowing directories to have multiple encryption policies would
> > >>> bring in a lot of complexity.  How would it be configured, and what would the
> > >>> semantics for accessing it be?  Where would the encryption policies be stored?
> > >>> What if you have added some of the keys but not all of them?  What if some of
> > >>> the keys get removed but not all of them?
> > >> I'd planned to use add_enckey to add all the necessary keys, set_encpolicy
> > >> on an encrypted directory under the proper conditions (flags interpreted by
> > >> ioctl? check if filesystem has hook?) recursively calls a
> > >> filesystem-provided hook on each inode within to change the fscrypt_context.
> > >
> > > That sounds quite complex.  Recursive directory operations aren't really
> > > something the kernel does.  It would also require updating every inode, causing
> > > COW of every inode.  Isn't that something you'd really like to avoid, to make
> > > starting a new container as fast and lightweight as possible?
> >
> > A fair point. Can move the penalty to open or write time instead though:
> > btrfs could store a generation number with the new context on only the
> > directory changed, then leaf inodes or new extent can traverse up the
> > directory tree and grab context from the highest-generation-number
> > directory in its path to inherit from. Or btrfs could disallow changing
> > except on the base of a subvolume, and just go directly to the top of
> > the subvolume to grab the appropriate context. Neither of those require
> > recursion outside btrfs.
> >
> > >> On various machines, we currently have a btrfs filesystem containing various
> > >> volumes/snapshots containing starting states for containers. The snapshots
> > >> are generated by common snapshot images built centrally. The machines, as
> > >> the scheduler requests, start and stop containers on those volumes.
> > >>
> > >> We want to be able to start a container on a snapshot/volume such that every
> > >> write to the relevant snapshot/volume is using a per-container key, but we
> > >> don't care about reads of the starting snapshot image being encrypted since
> > >> the starting snapshot image is widely shared. When the container is stopped,
> > >> no future different container (or human or host program) knows its key. This
> > >> limits the data which could be lost to a malicious service/human on the host
> > >> to only the volumes containing currently running containers.
> > >>
> > >> Some other folks envision having a base image encrypted with some per-vendor
> > >> key. Then the machine is rekeyed with a per-machine key in perhaps the TPM
> > >> to use for updates and logfiles. When a user is created, a snapshot of a
> > >> base homedir forms the base of their user subvolume/directory, which is then
> > >> rekeyed with a per-user key. When the user logs in, systemd-homedir or the
> > >> like could load their per-user key for their user subvolume/directory.
> > >>
> > >> Since we don't care about encrypting the common image, we initially
> > >> envisioned unencrypted snapshot images where we then turn on encryption and
> > >> have mixed unenc/enc data. The other usecase, though, really needs key
> > >> change so that everything's encrypted. And the argument that mixed unenc/enc
> > >> data is not safe was compelling.
> > >>
> > >> Hope that helps?
> > >
> > > Maybe a dumb question: why aren't you just using overlayfs?  It's already
> > > possible to use overlayfs with an fscrypt-encrypted upperdir and workdir.  When
> > > creating a new container you can create a new directory and assign it an fscrypt
> > > policy (with a per-container or per-user key or whatever that container wants),
> > > and create two subdirectories 'upperdir' and 'workdir' in it.  Then just mount
> > > an overlayfs with that upperdir and workdir, and lowerdir referring to the
> > > starting rootfs.  Then use that overlayfs as the rootfs as the container.
> > >
> > > Wouldn't that solve your use case exactly?  Is there a reason you really want to
> > > create the container directly from a btrfs snapshot instead?
> >
> > Hardly; a quite intriguing idea. Let me think about this with folks when
> > we get back to work on Wednesday. Not sure how it goes with the other
> > usecase, the base image/per-machine/per-user combo, but will think about it.
> 
> I like creating containers directly based on my host system for
> development and destructive purposes. It saves space and is incredibly
> useful.

A solution for that already exists.  It's called btrfs snapshots.  Which you
probably already know, since it's probably what you're using :-)

Using overlayfs would simply mean that each container consists of an upper and
lower directory instead of a single directory.  Either or both could still be
btrfs subvolumes.  They could even be on the same subvolume.

> 
> But the layered key encryption thing is also core to the encryption
> strategy we want to take in Fedora, so I would really like to see this
> be possible with Btrfs encryption.
> 
> Critically, it means that unlocking a user subvolume will always be
> multi-factor: something you have (machine key) and something you know
> (user credentials).

That's possible with the existing fscrypt semantics.  Just use a unique master
key for each container, and protect it with a key derived from both the machine
key *and* the user credential.  Protecting the fscrypt master key(s) is a
userspace problem, not a kernel one.  The kernel just receives the raw key.

- Eric