On Wed, Oct 27, 2021 at 08:17:55AM -0400, Jeff Layton wrote: > On Wed, 2021-10-27 at 13:12 +0800, Xiubo Li wrote: > > On 10/26/21 4:01 AM, Jeff Layton wrote: > > > On Wed, 2021-10-20 at 21:28 +0800, xiubli@xxxxxxxxxx wrote: > > > > From: Xiubo Li <xiubli@xxxxxxxxxx> > > > > > > > > This will transfer the encrypted last block contents to the MDS > > > > along with the truncate request only when new size is smaller and > > > > not aligned to the fscrypt BLOCK size. > > > > > > > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx> > > > > --- > > > > fs/ceph/caps.c | 9 +-- > > > > fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------ > > > > 2 files changed, 190 insertions(+), 29 deletions(-) > > > > > > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c > > > > index 4e2a588465c5..1a36f0870d89 100644 > > > > --- a/fs/ceph/caps.c > > > > +++ b/fs/ceph/caps.c > > ... > > > > +fill_last_block: > > > > + pagelist = ceph_pagelist_alloc(GFP_KERNEL); > > > > + if (!pagelist) > > > > + return -ENOMEM; > > > > + > > > > + /* Insert the header first */ > > > > + header.ver = 1; > > > > + header.compat = 1; > > > > + /* sizeof(file_offset) + sizeof(block_size) + blen */ > > > > + header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE); > > > > + header.file_offset = cpu_to_le64(orig_pos); > > > > + if (fill_header_only) { > > > > + header.file_offset = cpu_to_le64(0); > > > > + header.block_size = cpu_to_le64(0); > > > > + } else { > > > > + header.file_offset = cpu_to_le64(orig_pos); > > > > + header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE); > > > > + } > > > > + ret = ceph_pagelist_append(pagelist, &header, sizeof(header)); > > > > + if (ret) > > > > + goto out; > > > > > > > > > > > Note that you're doing a read/modify/write cycle, and you must ensure > > > that the object remains consistent between the read and write or you may > > > end up with data corruption. This means that you probably need to > > > transmit an object version as part of the write. See this patch in the > > > stack: > > > > > > libceph: add CEPH_OSD_OP_ASSERT_VER support > > > > > > That op tells the OSD to stop processing the request if the version is > > > wrong. > > > > > > You'll want to grab the "ver" from the __ceph_sync_read call, and then > > > send that along with the updated last block. Then, when the MDS is > > > truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to > > > ensure that the object hasn't changed when doing it. If the assertion > > > trips, then the MDS should send back EAGAIN or something similar to the > > > client to tell it to retry. > > > > > > It's also possible (though I've never used it) to make an OSD request > > > assert that the contents of an extent haven't changed, so you could > > > instead send along the old contents along with the new ones, etc. > > > > > > That may end up being more efficient if the object is getting hammered > > > with I/O in other fscrypt blocks within the same object. It may be worth > > > exploring that avenue as well. > > > > Hi Jeff, > > > > One questions about this: > > > > Should we consider that the FSCRYPT BLOCK will cross two different Rados > > objects ? As default the Rados object size is 4MB. > > > > In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ? > > > > Or the object size should always be multiple of FSCRYPT BLOCK size ? > > > > The danger here is that it's very hard to ensure atomicity in RADOS > across two different objects. If your crypto blocks can span objects, > then you can end up with torn writes, and a torn write on a crypto block > turns it into garbage. > > So, I think we want to forbid: > > 1/ custom file layouts on encrypted files, to ensure that we don't end > up with weird object sizes. Luis' patch from August does this, but I > think we might want the MDS to also vet this. > > 2/ block sizes larger than the object size I believe that object sizes have a minimum of 65k, defined by CEPH_MIN_STRIPE_UNIT. So, maybe I'm oversimplifying, but I think we just need to ensure (a build-time check?) that CEPH_FSCRYPT_BLOCK_SIZE <= CEPH_MIN_STRIPE_UNIT Cheers, -- Luís > 3/ non-power-of-two crypto block sizes (so no 3k or 5k blocks, but you > could do 1k, 2k, 4k, 8k, etc...) > > ...with that we should be able to ensure that they never span objects. > Eventually we may be able to relax some of these constraints, but I > don't think most users will have a problem with these constraints. > > -- > Jeff Layton <jlayton@xxxxxxxxxx> >