Re: [PATCH] ceph: do not truncate pagecache if truncate size doesn't change

Xiubo Li <xiubli@xxxxxxxxxx> · Wed, 17 Nov 2021 09:21:32 +0800

On 11/17/21 4:06 AM, Jeff Layton wrote:
On Tue, 2021-11-16 at 17:20 +0800, xiubli@xxxxxxxxxx wrote:
From: Xiubo Li <xiubli@xxxxxxxxxx>

In case truncating a file to a smaller sizeA, the sizeA will be kept
in truncate_size. And if truncate the file to a bigger sizeB, the
MDS will only increase the truncate_seq, but still using the sizeA as
the truncate_size.

Do you mean "kept in ci->i_truncate_size" ?

Sorry for confusing. It mainly will be kept in the MDS side's 
CInode->inode.truncate_size. And also will be propagated to all the 
clients' ci->i_truncate_size member.

The MDS will only change CInode->inode.truncate_size when truncating a 
smaller size.


If so, is this really the
correct fix? I'll note this in the sources:

         u32 i_truncate_seq;        /* last truncate to smaller size */
         u64 i_truncate_size;       /*  and the size we last truncated down to */

Maybe the MDS ought not bump the truncate_seq unless it was truncating
to a smaller size? If not, then that comment seems wrong at least.

Yeah, the above comments are inconsistent with what the MDS is doing.

Okay, I missed reading the code, I found in MDS that is introduced by 
commit :

     bf39d32d936 mds: bump truncate seq when fscrypt_file changes

With the size handling feature support, I think this commit will make no 
sense any more since we will calculate the 'truncating_smaller' by not 
only comparing the new_size and old_size, which both are rounded up to 
FSCRYPT BLOCK SIZE, will also check the 'req->get_data().length()' if 
the new_size and old_size are the same.



So when filling the inode it will truncate the pagecache by using
truncate_sizeA again, which makes no sense and will trim the inocent
pages.

Is there a reproducer for this? It would be nice to put something in
xfstests for it if so.

In xfstests' generic/075 has already testing this, but i didn't see any 
issue it reproduce. I just found this strange logs when it's doing 
something like:

truncateA 0x10000 --> 0x2000

truncateB 0x2000   --> 0x8000

truncateC 0x8000   --> 0x6000

For the truncateC, the log says:

ceph:  truncate_size 0x2000 -> 0x6000


The problem is that the truncateB will also do the vmtruncate by using 
the 0x2000 instead, the vmtruncate will not flush the dirty pages to the 
OSD and will just discard them from the pagecaches. Then we may lost 
some new updated data in case there has any write before the truncateB 
in range [0x2000, 0x8000).


Thanks

BRs

-- Xiubo



Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
---
  fs/ceph/inode.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 1b4ce453d397..b4f784684e64 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -738,10 +738,11 @@ int ceph_fill_file_size(struct inode *inode, int issued,
  			 * don't hold those caps, then we need to check whether
  			 * the file is either opened or mmaped
  			 */
-			if ((issued & (CEPH_CAP_FILE_CACHE|
+			if (ci->i_truncate_size != truncate_size &&
+			    ((issued & (CEPH_CAP_FILE_CACHE|
  				       CEPH_CAP_FILE_BUFFER)) ||
  			    mapping_mapped(inode->i_mapping) ||
-			    __ceph_is_file_opened(ci)) {
+			    __ceph_is_file_opened(ci))) {
  				ci->i_truncate_pending++;
  				queue_trunc = 1;
  			}