On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > tl;dr version: in cephfs, the MDS handles truncating object data when > inodes are truncated. This is problematic with fscrypt. > > Longer version: > > I've been working on a patchset to add fscrypt support to kcephfs, and > have hit a problem with the way that truncation is handled. The main > issue is that fscrypt uses block-based ciphers, so we must ensure that > we read and write complete crypto blocks on the OSDs. > > I'm currently using 4k crypto blocks, but we may want to allow this to > be tunable eventually (though it will need to be smaller than and align > with the OSD object size). For simplicity's sake, I'm planning to > disallow custom layouts on encrypted inodes. We could consider adding > that later (but it doesn't sound likely to be worthwhile). > > Normally, when a file is truncated (usually via a SETATTR MDS call), the > MDS handles truncating or deleting objects on the OSDs. This is done > somewhat lazily in that the MDS replies to the client before this > process is complete (AFAICT). So I've done some more research on this and it's not that simplistic. Broadly, a truncate causes the following to happen: - Revoke all write caps (but not Fcb) from clients. - Journal the truncate operation. - Respond with unsafe reply. - After setattr is journalled, regrant Fs with new file size, truncate_seq, truncate_size - issue trunc cap update with new file size, truncate_seq, truncate_size (looks redundant with prior step) - actually start truncating objects above file size; concurrently grant all wanted Fwb... caps wanted by client - reply safe >From what I can tell, the clients use the truncate_seq/truncate_size to avoid writing to data what the MDS plans to truncate. I haven't really dug into how that works. Maybe someone more familiar with that code can chime in. So the MDS seems to truncate/delete objects lazily in the background but it does so safely and consistently. > Once we add fscrypt support, the MDS handling truncation becomes a > problem, in that we need to be able to deal with complete crypto blocks. > Letting the MDS truncate away part of a block will leave us with a block > that can't be decrypted. > > There are a number of possible approaches to fixing this, but ultimately > the client will have to zero-pad, encrypt and write the blocks at the > edges since the MDS doesn't have access to the keys. > > There are several possible approaches that I've identified: > > 1/ We could teach the MDS the crypto blocksize, and ensure that it > doesn't truncate away partial blocks. The client could tell the MDS what > blocksize it's using on the inode and the MDS could ensure that > truncates align to the blocks. The client will still need to write > partial blocks at the edges of holes or at the EOF, and it probably > shouldn't do that until it gets the unstable reply from the MDS. We > could handle this by adding a new truncate op or extending the existing > one. > > 2/ We could cede the object truncate/delete to the client altogether. > The MDS is aware when an inode is encrypted so it could just not do it > for those inodes. We also already handle hole punching completely on the > client (though the size doesn't change there). Truncate could be a > special case of that. Probably, the client would issue the truncate and > then be responsible for deleting/rewriting blocks after that reply comes > in. We'd have to consider how to handle delinquent clients that don't > clean up correctly. We can't really do this I think. The MDS necessarily mediates between clients when files are truncated. > 3/ We could maintain a separate field in the inode for the real > inode->i_size that crypto-enabled clients would use. The client would > always communicate a size to the MDS that is rounded up to the end of > the last crypto block, such that the "true" size of the inode on disk > would always be represented in the rstats. Only crypto-enabled clients > would care about the "realsize" field. In fact, this value could > _itself_ be encrypted too, so that the i_size of the file is masked from > clients that don't have keys. > > Ceph's truncation machinery is pretty complex in general, so I could > have missed other approaches or something that makes these ideas > impossible. I'm leaning toward #3 here since I think it has the most > benefit and keeps the MDS out of the whole business. "realsize" could be mediated by the same locks as the inode size, so it should not be a complicated addition. Informing the MDS about a blocksize may be worse in the long run as it complicates all the truncate code paths, I think. From our past conversations, I think we posed (1) to generalize the (3) option? I don't have a strong opinion now on which is better in the long run (either for encryption or the maintainability of CephFS). If you're going to encrypt the realsize I wonder what other metadata you might encrypt? -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D