Re: [PATCH] ceph: do not truncate pagecache if truncate size doesn't change

Xiubo Li <xiubli@xxxxxxxxxx> · Tue, 23 Nov 2021 16:06:13 +0800

On 11/23/21 9:00 AM, Xiubo Li wrote:

On 11/23/21 3:10 AM, Jeff Layton wrote:
[...]
One thing I'm finding today is that this patch reliably makes
generic/445 hang at umount time with -o test_dummy_encryption
enabled...which is a bit strange as the test doesn't actually run:

     [jlayton@client1 xfstests-dev]$ sudo ./tests/generic/445
     QA output created by 445
     445 not run: xfs_io falloc  failed (old kernel/wrong fs?)
     [jlayton@client1 xfstests-dev]$ sudo umount /mnt/test

...and the umount hangs waiting for writeback to complete. When I back
this patch out, the problem goes away. Are you able to reproduce this?

There are no mds or osd calls in flight, and no caps (according to
debugfs). This is using -o test_dummy_encryption to force encryption.

I have hit a same issue without the "test_dummy_encryption", and it 
got stuck but I didn't see any call to ceph. But not the 445, I 
couldn't remember which one, I thought it was something wrong with my 
OS, I just rebooted my VM.

# ps -aux | grep generic

root      564385  0.0  0.0  11804  4700 pts/1    S+   09:41 0:00 
/bin/bash ./tests/generic/318

# cat /proc/564385/stack

[<0>] do_wait+0x2cc/0x4e0
[<0>] kernel_wait4+0xec/0x1b0
[<0>] __do_sys_wait4+0xe0/0xf0
[<0>] do_syscall_64+0x37/0x80
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae

I have hit this again today, I found that the MDS daemon crashed, and 
when the standby MDSes were replaying the journal log they crashed too.

I think this should be the reason why they stuck. I will check it.

-- Xiubo

I ran the ceph.exlude tests for two days, I just saw this one time.

I have attached the test results, does it the same with yours ? There 
have many test cases didn't run.

There have 4 failures and for the generic/020 it will be reproducable 
by 30%. All the other 3 failures are every time, but they all seems 
not relevant to fscrypt.

I narrowed it down to the call to _require_seek_data_hole. That calls
the seek_sanity_test binary and after that point, umounting the fs
hangs. I've not yet been successful at reproducing this while running
the binary by hand, so there may be some other preliminary ops that are
a factor too.

In any case, this looks like a regression, so I'm going to drop this
patch for now. I'll keep poking at the problem too however.