Re: Btrfs Fscrypt Design Document

Omar Sandoval <osandov@xxxxxxxxxxx> · Tue, 26 Oct 2021 00:53:25 -0700

On Mon, Oct 25, 2021 at 12:49:51PM -0700, Eric Biggers wrote:
> On Thu, Oct 21, 2021 at 11:34:19AM -0700, Omar Sandoval wrote:
> > Hello,
> > 
> > I've been working on adding fscrypt support to Btrfs. Btrfs has some
> > features (namely, reflinks and snapshots) that don't work well with the
> > existing fscrypt encryption policies. I've been discussing and
> > prototyping how to support these Btrfs features with fscrypt, so I
> > figured it was high time I write it down and loop in the fscrypt
> > developers as well.
> > 
> > Here is the Google Doc:
> > https://docs.google.com/document/d/1iNnrqyZqJ2I5nfWKt7cd1T9xwU0iHhjhk9ALQW3XuII/edit?usp=sharing
> > 
> > Please feel free to comment there or via email.
> > 
> 
> Just some preliminary comments:
> 
> Given that you need reflinking to remain supported, for file contents encryption
> I think it's the right choice to store the IVs explicitly rather than have them
> determined by the offset within the file.
> 
> How many derived encryption keys to use is somewhat orthogonal to that.  As I
> mentioned in my other mail, you could still have one key per extent rather than
> one per encryption policy as you're proposing.  I'm *guessing* it wouldn't be
> practical, and I don't consider it to be required (just preferable), but the
> document doesn't discuss this possibility at all.

I overlooked this option because my gut instinct was that the memory
usage would be prohibitive. It looks like one AES-256-XTS prepared key
is about 1k in memory (960 bytes for the encryption and decryption key
schedules for each key, plus a bit more for the crypto API structures).
I thought it'd be too expensive to store this naively for each cached
extent.

However, across various machines I checked, the number of cached inodes
and the number of cached extents is in the same order magnitude (and in
fact, almost equal in many cases). So per-extent keys aren't out of the
question. We can store a 16-byte nonce in the extent, use that to derive
the per-extent key from the master key, and use the offset in the extent
as the IV. I'll think about it some more and make sure I'm not missing
anything.

> Storing just the "starting IV" for each extent also makes sense, assuming that
> you only want to support an unauthenticated mode such as AES-XTS.  However,
> given that btrfs is a copy-on-write filesystem and thus can support per-block
> metadata, a natural question is why not support an authenticated mode such as
> AES-GCM, with a nonce and authentication tag stored per block?  Have you thought
> about this?
> 
> Now, I personally think that authenticating file contents only wouldn't give
> much benefit, and whole-filesystem authentication would be needed to get a real
> benefit.  But "why aren't you using an authenticated mode" is a *very* common
> question, so you need an answer to that -- or ideally, just support it if it
> isn't much work.

We already store a checksum per block; I don't see any reason that it
couldn't be a MAC. Johannes Thumshirn had a proof of concept for storing
an HMAC for all blocks:
https://lore.kernel.org/linux-btrfs/20191015121405.19066-1-jthumshirn@xxxxxxx/#b
Plumbing it through for authenticated encryption would be a little
harder, but probably not by much.

> What is your proposal for how filenames encryption would work when the
> EXPLICIT_IV flag is used?  That doesn't appear to be mentioned.

Since there's no such thing as "reflinking" filenames, I think filename
encryption can be unchanged, i.e., per-directory encryption keys. (This
would probably be the case with per-extent keys for data, as well.)

> Finally, the proposal to allow encrypting the changed data of snapshots is a
> larger departure from the fscrypt model.  I'm still trying to wrap my head
> around how that could work.  Could you provide any more details about that?
> E.g. what metadata would actually be stored on-disk, and how would it be used?
> When would things be done in terms of filesystem operations?  E.g. let's say I
> open a file for writing -- would the encryption key be set up right away, or
> would it not happen until I actually write data?

On disk, we still only need to store the usual fscrypt context. It will
always be present for the top-level of the snapshot. It may or may not
be present for any files or directories under that.

In memory, we'd store whether the subvolume is encrypted. This would be
set when enabling encryption and when caching the subvolume. Since every
inode has a reference to the subvolume it is in, and inodes can't move
between subvolumes, all we need is a check like:

if (IS_ENCRYPTED(inode->subvolume) && !IS_ENCRYPTED(inode))
	set_up_encryption(inode);

I'm leaning towards doing that either at the time that userspace writes
the data, or at the time that we're flushing the data to disk, whichever
ends up being more convenient for Btrfs. I'd rather not do it at open
time.

Thanks for the very helpful reply!