Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Oct 27, 2021 at 08:17:55AM -0400, Jeff Layton wrote:
> On Wed, 2021-10-27 at 13:12 +0800, Xiubo Li wrote:
> > On 10/26/21 4:01 AM, Jeff Layton wrote:
> > > On Wed, 2021-10-20 at 21:28 +0800, xiubli@xxxxxxxxxx wrote:
> > > > From: Xiubo Li <xiubli@xxxxxxxxxx>
> > > > 
> > > > This will transfer the encrypted last block contents to the MDS
> > > > along with the truncate request only when new size is smaller and
> > > > not aligned to the fscrypt BLOCK size.
> > > > 
> > > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> > > > ---
> > > >   fs/ceph/caps.c  |   9 +--
> > > >   fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
> > > >   2 files changed, 190 insertions(+), 29 deletions(-)
> > > > 
> > > > diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
> > > > index 4e2a588465c5..1a36f0870d89 100644
> > > > --- a/fs/ceph/caps.c
> > > > +++ b/fs/ceph/caps.c
> > ...
> > > > +fill_last_block:
> > > > +	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
> > > > +	if (!pagelist)
> > > > +		return -ENOMEM;
> > > > +
> > > > +	/* Insert the header first */
> > > > +	header.ver = 1;
> > > > +	header.compat = 1;
> > > > +	/* sizeof(file_offset) + sizeof(block_size) + blen */
> > > > +	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
> > > > +	header.file_offset = cpu_to_le64(orig_pos);
> > > > +	if (fill_header_only) {
> > > > +		header.file_offset = cpu_to_le64(0);
> > > > +		header.block_size = cpu_to_le64(0);
> > > > +	} else {
> > > > +		header.file_offset = cpu_to_le64(orig_pos);
> > > > +		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
> > > > +	}
> > > > +	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
> > > > +	if (ret)
> > > > +		goto out;
> > > > 
> > > > 
> > > Note that you're doing a read/modify/write cycle, and you must ensure
> > > that the object remains consistent between the read and write or you may
> > > end up with data corruption. This means that you probably need to
> > > transmit an object version as part of the write. See this patch in the
> > > stack:
> > > 
> > >      libceph: add CEPH_OSD_OP_ASSERT_VER support
> > > 
> > > That op tells the OSD to stop processing the request if the version is
> > > wrong.
> > > 
> > > You'll want to grab the "ver" from the __ceph_sync_read call, and then
> > > send that along with the updated last block. Then, when the MDS is
> > > truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
> > > ensure that the object hasn't changed when doing it. If the assertion
> > > trips, then the MDS should send back EAGAIN or something similar to the
> > > client to tell it to retry.
> > > 
> > > It's also possible (though I've never used it) to make an OSD request
> > > assert that the contents of an extent haven't changed, so you could
> > > instead send along the old contents along with the new ones, etc.
> > > 
> > > That may end up being more efficient if the object is getting hammered
> > > with I/O in other fscrypt blocks within the same object. It may be worth
> > > exploring that avenue as well.
> > 
> > Hi Jeff,
> > 
> > One questions about this:
> > 
> > Should we consider that the FSCRYPT BLOCK will cross two different Rados 
> > objects ? As default the Rados object size is 4MB.
> > 
> > In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?
> > 
> > Or the object size should always be multiple of FSCRYPT BLOCK size ?
> > 
> 
> The danger here is that it's very hard to ensure atomicity in RADOS
> across two different objects. If your crypto blocks can span objects,
> then you can end up with torn writes, and a torn write on a crypto block
> turns it into garbage.
> 
> So, I think we want to forbid:
> 
> 1/ custom file layouts on encrypted files, to ensure that we don't end
> up with weird object sizes. Luis' patch from August does this, but I
> think we might want the MDS to also vet this.
> 
> 2/ block sizes larger than the object size

I believe that object sizes have a minimum of 65k, defined by
CEPH_MIN_STRIPE_UNIT.  So, maybe I'm oversimplifying, but I think we just
need to ensure (a build-time check?) that

  CEPH_FSCRYPT_BLOCK_SIZE <= CEPH_MIN_STRIPE_UNIT

Cheers,
--
Luís

> 3/ non-power-of-two crypto block sizes (so no 3k or 5k blocks, but you
> could do 1k, 2k, 4k, 8k, etc...)
> 
> ...with that we should be able to ensure that they never span objects.
> Eventually we may be able to relax some of these constraints, but I
> don't think most users will have a problem with these constraints.
> 
> -- 
> Jeff Layton <jlayton@xxxxxxxxxx>
> 



[Index of Archives]     [CEPH Users]     [Ceph Large]     [Ceph Dev]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux