Re: [PATCH v2 4/4] ceph: add truncate size handling support for fscrypt

Xiubo Li <xiubli@xxxxxxxxxx> · Wed, 27 Oct 2021 13:12:23 +0800

On 10/26/21 4:01 AM, Jeff Layton wrote:
On Wed, 2021-10-20 at 21:28 +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

This will transfer the encrypted last block contents to the MDS
along with the truncate request only when new size is smaller and
not aligned to the fscrypt BLOCK size.

Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---
  fs/ceph/caps.c  |   9 +--
  fs/ceph/inode.c | 210 ++++++++++++++++++++++++++++++++++++++++++------
  2 files changed, 190 insertions(+), 29 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 4e2a588465c5..1a36f0870d89 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
...
+fill_last_block:
+	pagelist = ceph_pagelist_alloc(GFP_KERNEL);
+	if (!pagelist)
+		return -ENOMEM;
+
+	/* Insert the header first */
+	header.ver = 1;
+	header.compat = 1;
+	/* sizeof(file_offset) + sizeof(block_size) + blen */
+	header.data_len = cpu_to_le32(8 + 8 + CEPH_FSCRYPT_BLOCK_SIZE);
+	header.file_offset = cpu_to_le64(orig_pos);
+	if (fill_header_only) {
+		header.file_offset = cpu_to_le64(0);
+		header.block_size = cpu_to_le64(0);
+	} else {
+		header.file_offset = cpu_to_le64(orig_pos);
+		header.block_size = cpu_to_le64(CEPH_FSCRYPT_BLOCK_SIZE);
+	}
+	ret = ceph_pagelist_append(pagelist, &header, sizeof(header));
+	if (ret)
+		goto out;

Note that you're doing a read/modify/write cycle, and you must ensure
that the object remains consistent between the read and write or you may
end up with data corruption. This means that you probably need to
transmit an object version as part of the write. See this patch in the
stack:

     libceph: add CEPH_OSD_OP_ASSERT_VER support

That op tells the OSD to stop processing the request if the version is
wrong.

You'll want to grab the "ver" from the __ceph_sync_read call, and then
send that along with the updated last block. Then, when the MDS is
truncating, it can use a CEPH_OSD_OP_ASSERT_VER op with that version to
ensure that the object hasn't changed when doing it. If the assertion
trips, then the MDS should send back EAGAIN or something similar to the
client to tell it to retry.

It's also possible (though I've never used it) to make an OSD request
assert that the contents of an extent haven't changed, so you could
instead send along the old contents along with the new ones, etc.

That may end up being more efficient if the object is getting hammered
with I/O in other fscrypt blocks within the same object. It may be worth
exploring that avenue as well.

Hi Jeff,

One questions about this:

Should we consider that the FSCRYPT BLOCK will cross two different Rados 
objects ? As default the Rados object size is 4MB.

In case the FSCRYPT BLOCK size is 4K, when the object size is 3K or 5K ?

Or the object size should always be multiple of FSCRYPT BLOCK size ?

Thanks