On Fri, 2021-03-12 at 00:43 -0800, Gregory Farnum wrote: > On Thu, Mar 11, 2021 at 8:18 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > > > > On Thu, Mar 11, 2021 at 8:15 AM Jeff Layton <jlayton@xxxxxxxxxx> wrote: > > > > > > tl;dr version: in cephfs, the MDS handles truncating object data when > > > inodes are truncated. This is problematic with fscrypt. > > > > > > Longer version: > > > > > > I've been working on a patchset to add fscrypt support to kcephfs, and > > > have hit a problem with the way that truncation is handled. The main > > > issue is that fscrypt uses block-based ciphers, so we must ensure that > > > we read and write complete crypto blocks on the OSDs. > > > > > > I'm currently using 4k crypto blocks, but we may want to allow this to > > > be tunable eventually (though it will need to be smaller than and align > > > with the OSD object size). For simplicity's sake, I'm planning to > > > disallow custom layouts on encrypted inodes. We could consider adding > > > that later (but it doesn't sound likely to be worthwhile). > > > > > > Normally, when a file is truncated (usually via a SETATTR MDS call), the > > > MDS handles truncating or deleting objects on the OSDs. This is done > > > somewhat lazily in that the MDS replies to the client before this > > > process is complete (AFAICT). > > > > So I've done some more research on this and it's not that simplistic. > > Broadly, a truncate causes the following to happen: > > > > - Revoke all write caps (but not Fcb) from clients. > > > > - Journal the truncate operation. > > > > - Respond with unsafe reply. > > > > - After setattr is journalled, regrant Fs with new file size, > > truncate_seq, truncate_size > > > > - issue trunc cap update with new file size, truncate_seq, > > truncate_size (looks redundant with prior step) > > > > - actually start truncating objects above file size; concurrently > > grant all wanted Fwb... caps wanted by client > > > > - reply safe > > > > From what I can tell, the clients use the truncate_seq/truncate_size > > to avoid writing to data what the MDS plans to truncate. I haven't > > really dug into how that works. Maybe someone more familiar with that > > code can chime in. > > > > So the MDS seems to truncate/delete objects lazily in the background > > but it does so safely and consistently. > > Right; ti's lazy in that it's not done immediately in a blocking > manner, but it's absolutely safe. Truncate seq and size are also > fields you can send to the OSD on read or write operations, and the > client includes them on every op. It just has to do a (reasonably) > simple conversion from the total truncate size the MDS gives it to > what that means for the object being accessed (based on the striping > pattern and object number). > > I'll try and think a bit more on how to handle the special extra size > for encryption. > > ...although in my current sleep-addled state, I'm actually not sure we > need to add any permanent storage to the MDS to handle this case! We > can probably just extend the front-end truncate op so that it can take > a separate "real-truncate-size" and the logical file size, can't we? That would be one nice thing about the approach of #1. Truncating the size downward is always done via an explicit SETATTR op (I think), so we could just extend that with a new field for that tells the MDS where to stop truncating. Note that regardless of the approach we choose, the client will still need to do a read/modify/write on the edge block before we can really treat the truncation as "done". I'm not yet sure whether that has any bearing on the consistency/safety of the truncation process. -- Jeff Layton <jlayton@xxxxxxxxxx> _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx