Re: fscrypt and file truncation on cephfs

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 11 Mar 2021 20:17:47 -0800

On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote:
>
> tl;dr version: in cephfs, the MDS handles truncating object data when
> inodes are truncated. This is problematic with fscrypt.
>
> Longer version:
>
> I've been working on a patchset to add fscrypt support to kcephfs, and
> have hit a problem with the way that truncation is handled. The main
> issue is that fscrypt uses block-based ciphers, so we must ensure that
> we read and write complete crypto blocks on the OSDs.
>
> I'm currently using 4k crypto blocks, but we may want to allow this to
> be tunable eventually (though it will need to be smaller than and align
> with the OSD object size). For simplicity's sake, I'm planning to
> disallow custom layouts on encrypted inodes. We could consider adding
> that later (but it doesn't sound likely to be worthwhile).
>
> Normally, when a file is truncated (usually via a SETATTR MDS call), the
> MDS handles truncating or deleting objects on the OSDs. This is done
> somewhat lazily in that the MDS replies to the client before this
> process is complete (AFAICT).

So I've done some more research on this and it's not that simplistic.
Broadly, a truncate causes the following to happen:

- Revoke all write caps (but not Fcb) from clients.

- Journal the truncate operation.

- Respond with unsafe reply.

- After setattr is journalled, regrant Fs with new file size,
truncate_seq, truncate_size

- issue trunc cap update with new file size, truncate_seq,
truncate_size (looks redundant with prior step)

- actually start truncating objects above file size; concurrently
grant all wanted Fwb... caps wanted by client

- reply safe

>From what I can tell, the clients use the truncate_seq/truncate_size
to avoid writing to data what the MDS plans to truncate. I haven't
really dug into how that works. Maybe someone more familiar with that
code can chime in.

So the MDS seems to truncate/delete objects lazily in the background
but it does so safely and consistently.

> Once we add fscrypt support, the MDS handling truncation becomes a
> problem, in that we need to be able to deal with complete crypto blocks.
> Letting the MDS truncate away part of a block will leave us with a block
> that can't be decrypted.
>
> There are a number of possible approaches to fixing this, but ultimately
> the client will have to zero-pad, encrypt and write the blocks at the
> edges since the MDS doesn't have access to the keys.
>
> There are several possible approaches that I've identified:
>
> 1/ We could teach the MDS the crypto blocksize, and ensure that it
> doesn't truncate away partial blocks. The client could tell the MDS what
> blocksize it's using on the inode and the MDS could ensure that
> truncates align to the blocks. The client will still need to write
> partial blocks at the edges of holes or at the EOF, and it probably
> shouldn't do that until it gets the unstable reply from the MDS. We
> could handle this by adding a new truncate op or extending the existing
> one.
>
> 2/ We could cede the object truncate/delete to the client altogether.
> The MDS is aware when an inode is encrypted so it could just not do it
> for those inodes. We also already handle hole punching completely on the
> client (though the size doesn't change there). Truncate could be a
> special case of that. Probably, the client would issue the truncate and
> then be responsible for deleting/rewriting blocks after that reply comes
> in. We'd have to consider how to handle delinquent clients that don't
> clean up correctly.

We can't really do this I think. The MDS necessarily mediates between
clients when files are truncated.

> 3/ We could maintain a separate field in the inode for the real
> inode->i_size that crypto-enabled clients would use. The client would
> always communicate a size to the MDS that is rounded up to the end of
> the last crypto block, such that the "true" size of the inode on disk
> would always be represented in the rstats. Only crypto-enabled clients
> would care about the "realsize" field. In fact, this value could
> _itself_ be encrypted too, so that the i_size of the file is masked from
> clients that don't have keys.
>
> Ceph's truncation machinery is pretty complex in general, so I could
> have missed other approaches or something that makes these ideas
> impossible. I'm leaning toward #3 here since I think it has the most
> benefit and keeps the MDS out of the whole business.

"realsize" could be mediated by the same locks as the inode size, so
it should not be a complicated addition. Informing the MDS about a
blocksize may be worse in the long run as it complicates all the
truncate code paths, I think. From our past conversations, I think we
posed (1) to generalize the (3) option? I don't have a strong opinion
now on which is better in the long run (either for encryption or the
maintainability of CephFS).

If you're going to encrypt the realsize I wonder what other metadata
you might encrypt?

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx