Re: [PATCH v1 00/12] fscrypt: add extent encryption

Sweet Tea Dorminy <sweettea-kernel@xxxxxxxxxx> · Mon, 3 Jul 2023 13:06:17 -0400

base-commit: accadeb67609a5a5d088ebde8409c3f6db0b84b4

Thanks for sending this out!

It's going to take me a while to go through everything, so please bear with me.
In general I'd also really like to be seeing more feedback from the other btrfs
developers.  This is a hard project that really needs more eyes on it.

I appreciate your time on it!

 From a brief look through your patchsets, there's one thing I want to bring up
right away.  It seems that one important design choice that you've made that has
impacted much of your patchsets is that you've made each extent a fully
standalone thing, similar to inodes currently.  I.e.,

     (a) Each extent gets a full 'fscrypt_context' stored along with it.  That
         includes not just the nonce, but also the encryption modes and master
         key identifier.

     (b) For runtime caching, each extent gets a full 'struct fscrypt_info'
         object.  It doesn't "belong" to any inode; it's set up in a fully
         standalone way, and the master key lookup and removal logic operates
         directly on the extent's 'struct fscrypt_info'.

I'm not sure this is a good idea.  What I had thought it was going to look like
is that the encryption context/policy and 'struct fscrypt_info' would stay a
property of the inode, and the extents themselves would be much more lightweight
-- both on disk and in the cache.  On-disk, all that should really be needed for
an extent is the nonce for deriving the per-extent key.  And in-memory, all that
should really be needed is a "fscrypt_prepared_key" for the per-extent key, and
a reference to the owning inode.

>

The in memory reduction is plausible. For extents that are in memory but 

not yet written to disk, we need some way to keep track of the context, 

but we could drop the nonce/policy after that. I was aiming to have the 

same structure so that there's maximal similarity in info creation and 

things like fscrypt_generate_iv would always be getting an info, 

regardless of inode vs extent, but we could throw a conditional in there 

and create a different structure for in-memory extent infos.

However, it seems like an extent and a leaf inode in inode fscrypt need 

the same information, so if splitting the fscrypt_info structure makes 

sense, maybe it should be on that boundary?

I think that would avoid many of the problems that it seems you've had to work
around or have had to change user-visible semantics for.  For example the
problems involving master keys being added and removed.  It would also avoid
having to overload 'fscrypt_info' to be either a per-inode or a per-extent key.
And it would save space on disk and in memory.

I might be misunderstanding what you're referencing, but I think you're 

talking about the change where with extent fscrypt, IO has to be forced 

down before removing a key, otherwise it is lost. I think that's a 

fundamental problem given the filesystem has no way to know that there 

are new, dirty pages in the pagecache until those pages are issued for 

write, so it can't create a new extent or few until that point, 

potentially after the relevant key has been evicted. But maybe I'm 

missing a hook that would let us make extents earlier.

I suppose we could give each leaf inode a proper nonce/prepared key 

instead of borrowing its parent dir's: if a write came in after the key 

was removed but the inode is still open, the new extent(s) could grab 

the key material out of the inode's info. I don't like this very much 

since it could result in multiple extents grabbing the same key 

material, but I suppose it could work if it's important to maintain that 

behavior.

Can you elaborate on why you went with a more "heavyweight" extents design?

Being able to rekey a directory is the reason for having full contexts: 

suppose I take a snapshot of an encrypted dir and want to change the key 

for new data going forward, to avoid using a single key on too much 

data. It's too expensive to reencrypt every extent with the new key, 

since the whole point of a snapshot is to make a lightweight copy that 

gets COWed on write. Then each extent needs to know what its own master 

key identifier/policy/flags are, since different extents in the same 

file could have different master keys. We could say the mode and flags 

have to match, but it doesn't seem to me worth saving seven bytes to add 

a new structure to just store the master key identifier and nonce.

For a non-Meta usecase, from what I've heard from Fedora-land, it's 

possibly interesting to them to be able to ship an encrypted image, and 

then be able to change the key after encrypted install to something 

user-controlled.

Without rekeying, my understanding is that we may write too much data 

with one key for safety; notes in the updated design doc 

https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing 

are that writing more than 1P per key raises cryptographic concerns, and 

since btrfs is COW and could have volumes up to the full 16E size that 

btrfs supports, we don't want to have just one immutable key per subvol.

To me the lightweight-on-disk vision sounds a lot like the original 

design: 

https://lore.kernel.org/linux-btrfs/YXGyq+buM79A1S0L@relinquished.localdomain 

and the Nov '22 version of the patchset: 

https://lore.kernel.org/linux-btrfs/cover.1667389115.git.sweettea-kernel@xxxxxxxxxx/ 

(which didn't have rekeying). I think rekeying is worth the higher disk 

usage; but I'm probably missing something about how your vision differs 

from the original. Could you please take a look at it again?

Maybe your motivation is that extents can be referenced by more than one inode
and thus do not have a unique owning inode?  That's true, but I don't think that
really matters.  All the inodes that reference an extent will have the same

encryption policy, right? 

As above, not necessarily

Also, it looks like the "struct extent_map" that
you're caching the per-extent key in is already cached on a per-inode basis, in
btrfs_inode::extent_tree, similar to the pagecache which is also per-inode.  So
if the same extent happens to be accessed via multiple inodes, that's still
going to cause the fscrypt key to be set up twice anyway.

A good point, and if you want me to take advantage of the 

one-copy-per-inode fact for general extent-based fscrypt I can do so.

Many thanks!

Sweet Tea