base-commit: accadeb67609a5a5d088ebde8409c3f6db0b84b4Thanks for sending this out! It's going to take me a while to go through everything, so please bear with me. In general I'd also really like to be seeing more feedback from the other btrfs developers. This is a hard project that really needs more eyes on it.
I appreciate your time on it!
From a brief look through your patchsets, there's one thing I want to bring up right away. It seems that one important design choice that you've made that has impacted much of your patchsets is that you've made each extent a fully standalone thing, similar to inodes currently. I.e., (a) Each extent gets a full 'fscrypt_context' stored along with it. That includes not just the nonce, but also the encryption modes and master key identifier. (b) For runtime caching, each extent gets a full 'struct fscrypt_info' object. It doesn't "belong" to any inode; it's set up in a fully standalone way, and the master key lookup and removal logic operates directly on the extent's 'struct fscrypt_info'. I'm not sure this is a good idea. What I had thought it was going to look like is that the encryption context/policy and 'struct fscrypt_info' would stay a property of the inode, and the extents themselves would be much more lightweight -- both on disk and in the cache. On-disk, all that should really be needed for an extent is the nonce for deriving the per-extent key. And in-memory, all that should really be needed is a "fscrypt_prepared_key" for the per-extent key, and a reference to the owning inode.
>The in memory reduction is plausible. For extents that are in memory but not yet written to disk, we need some way to keep track of the context, but we could drop the nonce/policy after that. I was aiming to have the same structure so that there's maximal similarity in info creation and things like fscrypt_generate_iv would always be getting an info, regardless of inode vs extent, but we could throw a conditional in there and create a different structure for in-memory extent infos.
However, it seems like an extent and a leaf inode in inode fscrypt need the same information, so if splitting the fscrypt_info structure makes sense, maybe it should be on that boundary?
I think that would avoid many of the problems that it seems you've had to work around or have had to change user-visible semantics for. For example the problems involving master keys being added and removed. It would also avoid having to overload 'fscrypt_info' to be either a per-inode or a per-extent key. And it would save space on disk and in memory.
I might be misunderstanding what you're referencing, but I think you're talking about the change where with extent fscrypt, IO has to be forced down before removing a key, otherwise it is lost. I think that's a fundamental problem given the filesystem has no way to know that there are new, dirty pages in the pagecache until those pages are issued for write, so it can't create a new extent or few until that point, potentially after the relevant key has been evicted. But maybe I'm missing a hook that would let us make extents earlier.
I suppose we could give each leaf inode a proper nonce/prepared key instead of borrowing its parent dir's: if a write came in after the key was removed but the inode is still open, the new extent(s) could grab the key material out of the inode's info. I don't like this very much since it could result in multiple extents grabbing the same key material, but I suppose it could work if it's important to maintain that behavior.
Being able to rekey a directory is the reason for having full contexts: suppose I take a snapshot of an encrypted dir and want to change the key for new data going forward, to avoid using a single key on too much data. It's too expensive to reencrypt every extent with the new key, since the whole point of a snapshot is to make a lightweight copy that gets COWed on write. Then each extent needs to know what its own master key identifier/policy/flags are, since different extents in the same file could have different master keys. We could say the mode and flags have to match, but it doesn't seem to me worth saving seven bytes to add a new structure to just store the master key identifier and nonce.Can you elaborate on why you went with a more "heavyweight" extents design?
For a non-Meta usecase, from what I've heard from Fedora-land, it's possibly interesting to them to be able to ship an encrypted image, and then be able to change the key after encrypted install to something user-controlled.
Without rekeying, my understanding is that we may write too much data with one key for safety; notes in the updated design doc https://docs.google.com/document/d/1janjxewlewtVPqctkWOjSa7OhCgB8Gdx7iDaCDQQNZA/edit?usp=sharing are that writing more than 1P per key raises cryptographic concerns, and since btrfs is COW and could have volumes up to the full 16E size that btrfs supports, we don't want to have just one immutable key per subvol.
To me the lightweight-on-disk vision sounds a lot like the original design: https://lore.kernel.org/linux-btrfs/YXGyq+buM79A1S0L@relinquished.localdomain and the Nov '22 version of the patchset: https://lore.kernel.org/linux-btrfs/cover.1667389115.git.sweettea-kernel@xxxxxxxxxx/ (which didn't have rekeying). I think rekeying is worth the higher disk usage; but I'm probably missing something about how your vision differs from the original. Could you please take a look at it again?
Maybe your motivation is that extents can be referenced by more than one inode and thus do not have a unique owning inode? That's true, but I don't think that really matters. All the inodes that reference an extent will have the sameencryption policy, right?
As above, not necessarily
Also, it looks like the "struct extent_map" that you're caching the per-extent key in is already cached on a per-inode basis, in btrfs_inode::extent_tree, similar to the pagecache which is also per-inode. So if the same extent happens to be accessed via multiple inodes, that's still going to cause the fscrypt key to be set up twice anyway.
A good point, and if you want me to take advantage of the one-copy-per-inode fact for general extent-based fscrypt I can do so.
Many thanks! Sweet Tea