Re: [PATCH] ceph: do not truncate pagecache if truncate size doesn't change

Jeff Layton <jlayton@xxxxxxxxxx> · Fri, 19 Nov 2021 06:59:56 -0500

On Fri, 2021-11-19 at 12:29 +0800, Xiubo Li wrote:
> On 11/19/21 2:13 AM, Jeff Layton wrote:
> > On Tue, 2021-11-16 at 17:20 +0800, xiubli@xxxxxxxxxx wrote:
> > > From: Xiubo Li <xiubli@xxxxxxxxxx>
> > > 
> > > In case truncating a file to a smaller sizeA, the sizeA will be kept
> > > in truncate_size. And if truncate the file to a bigger sizeB, the
> > > MDS will only increase the truncate_seq, but still using the sizeA as
> > > the truncate_size.
> > > 
> > > So when filling the inode it will truncate the pagecache by using
> > > truncate_sizeA again, which makes no sense and will trim the inocent
> > > pages.
> > > 
> > > Signed-off-by: Xiubo Li <xiubli@xxxxxxxxxx>
> > > ---
> > >   fs/ceph/inode.c | 5 +++--
> > >   1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> > > index 1b4ce453d397..b4f784684e64 100644
> > > --- a/fs/ceph/inode.c
> > > +++ b/fs/ceph/inode.c
> > > @@ -738,10 +738,11 @@ int ceph_fill_file_size(struct inode *inode, int issued,
> > >   			 * don't hold those caps, then we need to check whether
> > >   			 * the file is either opened or mmaped
> > >   			 */
> > > -			if ((issued & (CEPH_CAP_FILE_CACHE|
> > > +			if (ci->i_truncate_size != truncate_size &&
> > > +			    ((issued & (CEPH_CAP_FILE_CACHE|
> > >   				       CEPH_CAP_FILE_BUFFER)) ||
> > >   			    mapping_mapped(inode->i_mapping) ||
> > > -			    __ceph_is_file_opened(ci)) {
> > > +			    __ceph_is_file_opened(ci))) {
> > >   				ci->i_truncate_pending++;
> > >   				queue_trunc = 1;
> > >   			}
> > Starting a new thread since this one is getting cluttered. I did a
> > kernel build today with the wip-fscrypt-size branch with this patch on
> > top and installed it on my client.
> > 
> > I also rebuilt my cluster machine to use a container image based on your
> > current fsize_support branch (commit 34eafd9db1ae). That host runs 3 VMs
> > that each act as cephadm hosts. Each is assigned a physical ssd. I
> > create two filesystems named "test" and "scratch", both with max_mds set
> > to 3. Under each of those, I create a directory /client1 and set each up
> > to use ceph.dir.pin.random at 0.1.
> > 
> > I saw 2 test failures and then it hung on generic/444:
> > 
> > generic/029 10s ... - output mismatch (see /home/jlayton/git/xfstests-dev/results//generic/029.out.bad)
> >      --- tests/generic/029.out	2020-11-09 14:47:52.488429897 -0500
> >      +++ /home/jlayton/git/xfstests-dev/results//generic/029.out.bad	2021-11-18 11:04:18.494800609 -0500
> >      @@ -49,17 +49,3 @@
> >       00001400  00                                                |.|
> >       00001401
> >       ==== Post-Remount ==
> >      -00000000  58 58 58 58 58 58 58 58  58 58 58 58 58 58 58 58  |XXXXXXXXXXXXXXXX|
> >      -*
> >      -00000200  58 57 57 57 57 57 57 57  57 57 57 57 57 57 57 57  |XWWWWWWWWWWWWWWW|
> >      -00000210  57 57 57 57 57 57 57 57  57 57 57 57 57 57 57 57  |WWWWWWWWWWWWWWWW|
> >      ...
> >      (Run 'diff -u /home/jlayton/git/xfstests-dev/tests/generic/029.out /home/jlayton/git/xfstests-dev/results//generic/029.out.bad'  to see the entire diff)
> > 
> > generic/112 75s ... [failed, exit status 1]- output mismatch (see /home/jlayton/git/xfstests-dev/results//generic/112.out.bad)
> >      --- tests/generic/112.out	2020-11-09 14:47:52.694435287 -0500
> >      +++ /home/jlayton/git/xfstests-dev/results//generic/112.out.bad	2021-11-18 11:24:03.519631823 -0500
> >      @@ -4,15 +4,4 @@
> >       -----------------------------------------------
> >       fsx.0 : -A -d -N numops -S 0
> >       -----------------------------------------------
> >      -
> >      ------------------------------------------------
> >      -fsx.1 : -A -d -N numops -S 0 -x
> >      ------------------------------------------------
> >      ...
> >      (Run 'diff -u /home/jlayton/git/xfstests-dev/tests/generic/112.out /home/jlayton/git/xfstests-dev/results//generic/112.out.bad'  to see the entire diff)
> 
> Locally I am using the vstart_runner to setup the cluster with only one 
> filesystem, and couldn't reproduce any issue you mentioned this time.
> 
>  From the "112.2.full" you attached, I can see that the mapread will 
> read the file range of [0x686907, 0x693eb9), which should be zero but 
> it's not:
> 
> 3540 trunc    from 0x789289 to 0xd8b1
> 3542 write    0x8571af thru    0x85e2af    (0x7101 bytes)
> 3543 trunc    from 0x85e2b0 to 0x7dc986
> 3550 read    0x3b37bd thru    0x3b50a8    (0x18ec bytes)
> 3551 mapread    0x686907 thru    0x693eb9    (0xd5b3 bytes)
> READ BAD DATA: offset = 0x686907, size = 0xd5b3, fname = 112.2
> OFFSET    GOOD    BAD    RANGE
> 0x686907    0x0000    0x198f    0x00000
> operation# (mod 256) for the bad data may be 143
> 0x686908    0x0000    0x8fec    0x00001
> operation# (mod 256) for the bad data may be 143
> 
> In theory that range should be truncated and zeroed in the first "trunc" 
> from 0x789289 to 0xd8b1 above. From the failure logs, it is the same 
> issue with before in generic/057 test case. But locally I couldn't 
> reproduce it after I fixing the bug in ceph fsize_support branch.
> 
> BTW, could you reproduce this every time ? Or just occasionally ?
> 
> 

I don't see the tests fail every time. 029 fails about 50% of the time
though. The fsx tests don't seem to fail quite so often. It seems to be
hard to reproduce these when running a single test in a loop, but when I
run the "quick" group of tests, I usually get at least one failure.

FWIW, my usual xfstests run is:

    sudo ./check -g quick -E ./ceph.exclude

...with ceph.exclude having:

ceph/001
generic/003
generic/531
generic/538

Most of the excluded tests actually work, but take a really long time on
ceph. generic/003 always fails though (atime handling).

> > 
> > I attached a tarball with the relevant output from the above two tests.
> > 
> > The hang is also interesting, though it may be unrelated to size
> > handling. It's just a touch command, and it's hung doing this:
> > 
> > [jlayton@client1 xfstests-dev]$ sudo !!
> > sudo cat /proc/100726/stack
> > [<0>] ceph_mdsc_do_request+0x169/0x270 [ceph]
> > [<0>] ceph_atomic_open+0x3eb/0x10f0 [ceph]
> > [<0>] lookup_open.isra.0+0x59c/0x8f0
> > [<0>] path_openat+0x4d2/0x10f0
> > [<0>] do_filp_open+0x131/0x230
> > [<0>] do_sys_openat2+0xe4/0x250
> > [<0>] __x64_sys_openat+0xbd/0x110
> > [<0>] do_syscall_64+0x3b/0x90
> > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> > 
> > This was in dmesg:
> > 
> > [ 5338.083673] run fstests generic/444 at 2021-11-18 12:07:44
> > [ 5405.607913] ceph: mds1 caps stale
> > [ 5642.657735] ceph: mds1 hung
> > [ 5644.964139] libceph: mds2 (2)192.168.1.81:6804 socket closed (con state OPEN)
> > [ 5644.977876] libceph: mds0 (2)192.168.1.81:6814 socket closed (con state OPEN)
> > [ 5645.159578] libceph: mds0 (2)192.168.1.81:6814 session reset
> > [ 5645.168000] ceph: mds0 closed our session
> > [ 5645.175634] ceph: mds0 reconnect start
> > [ 5645.196569] ceph: mds0 reconnect denied
> > [ 5648.295458] libceph: mds2 (2)192.168.1.81:6804 session reset
> > [ 5648.303578] ceph: mds2 closed our session
> > [ 5648.311251] ceph: mds2 reconnect start
> > [ 5648.330261] ceph: mds2 reconnect denied
> > 
> > The logs on mds1 show this:
> > 
> > Nov 18 11:52:56 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm Updating MDS map to version 106 from mon.1
> > Nov 18 11:53:42 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm Updating MDS map to version 107 from mon.1
> > Nov 18 11:56:48 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 11:57:38 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 11:58:23 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 12:00:08 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 12:01:58 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 12:03:28 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 12:03:43 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm asok_command: client ls {prefix=client ls} (starting...)
> > Nov 18 12:05:16 cephadm2 ceph-mds[2928]: mds.test.cephadm2.kfuenm Updating MDS map to version 108 from mon.1
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x2000000107f err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000085 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000088 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x2000000008b err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x2000000008e err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x2000000008e err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x2000000008f err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x2000000008f err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000090 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000090 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000091 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000092 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000092 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000093 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000094 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000094 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000095 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000096 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000096 err -116/0
> > Nov 18 12:05:34 cephadm2 ceph-mds[2928]: mds.1.cache  failed to open ino 0x20000000097 err -116/0
> > 
> > -116 is -ESTALE.
> > 
> > That repeats for a while and eventually the client is evicted.
> > 
> > I'm not sure what to make of this hang, but I've seen it happen twice
> > now. I'll see if it's reliably reproducible in a bit.
> > 
> > As a side note, we probably do need to ensure that we not continue to
> > wait on MDS ops on blacklisted sessions. If we get blacklisted, then we
> > should wake up any waiters and fail any reqs that are in flight. I'll
> > open a tracker for that soon.
> 
> Test this many times too:
> 
> [root@lxbceph1 xfstests]# ./check generic/444
> FSTYP         -- ceph
> PLATFORM      -- Linux/x86_64 lxbceph1 5.15.0+
> MKFS_OPTIONS  -- 10.72.47.117:40742:/testB
> MOUNT_OPTIONS -- -o 
> test_dummy_encryption,name=admin,nowsync,copyfrom,rasize=4096 -o 
> context=system_u:object_r:root_t:s0 10.72.47.117:40742:/testB /mnt/kcephfs.B
> 
> generic/444 2s ... 1s
> Ran: generic/444
> Passed all 1 tests
> 
> [root@lxbceph1 xfstests]#
> 
> 
> I am using almost the same mount options except the "ms_mode=crc". And 
> using the vstart_runner to setup the cluster and only one default 
> filesystem "a".
> 

The msgr version probably doesn't matter much, but you never know.

> The following is my kernel patches I am using:
> 
> $ git log --oneline
> 
> 33adb35fb859 (HEAD -> fscrypt_size_handling4) ceph: do not truncate 
> pagecache if truncate size doesn't change
> 42cb4762205a (origin/wip-fscrypt-size, lxb-testing13) ceph: add truncate 
> size handling support for fscrypt
> 428fe6185c09 ceph: add object version support for sync read
> b0e5dcc4ad03 ceph: add __ceph_sync_read helper support
> c316473f6c57 ceph: add __ceph_get_caps helper support
> 50709509eef9 ceph: handle fscrypt fields in cap messages from MDS
> 071d5fc5b21e ceph: get file size from fscrypt_file when present in inode 
> traces
> 48c6a21c000c ceph: fscrypt_file field handling in MClientRequest messages
> f101a36542a7 ceph: size handling for encrypted inodes in cap updates
> 67f017e4edf9 libceph: add CEPH_OSD_OP_ASSERT_VER support
> 98d487ede749 (origin/wip-fscrypt-fnames) ceph: don't allow changing 
> layout on encrypted files/directories
> ...
> 
> 

The kernel branch you're using is old. I rebased it earlier in the week
so that it sits on top of v5.16-rc1. There are fscrypt changes in there
that this set needed to be adapted for.

> For ceph:
> 
> $ git log --oneline
> 
> 34eafd9db1a (HEAD -> jf_fscrypt2, lxb/fsize_support) mds: add truncate 
> size handling support for fscrypt
> 4f9c1add5ff mds: don't allow changing layout on encrypted files/directories
> 993189035ab mds: encode fscrypt_auth and fscrypt_file in appropriate mds 
> locks
> 573c5bb1143 test: add fscrypt attribute testcases
> 6e60ab066d9 client: add support for fscrypt_auth and fscrypt_file fields
> be447f7911d mds: allow setattr to change fscrypt_auth/file
> 40bb9bb5a26 mds: populate fscrypt_auth/_file on inode creation
> b2838a2def2 client: send fscrypt_auth and fscrypt_file in MClientRequest
> 6ae96259947 mds: add fscrypt_auth and fscrypt_file fields to MClientCaps
> fd8f004f7a9 mds: encode fscrypt_auth and fscrypt_file in InodeStat
> c8abd5f34bd mds: add fscrypt opaque field to inode_t encoding
> 13619a8b96a mds: convert fscrypt flag to two opaque fields
> 0471f922a9b common: add encode and decode routines for opaque vectors of 
> bytes
> 4035ba419de (origin/master, origin/HEAD) Merge pull request #42906 from 
> sebastian-philipp/rm-ceph-orch-osd-i
> ...
> 
> 

I am using the same ceph branch as you are.

> Not sure why my tests are different with yours ?
> 

It could be all sorts of reasons. I have 3 MDS's per fs and I suspect
vstart_runner only gives you 1. I have (consumer-grade) SSDs assigned to
each OSD too. I'm also using random pinning, so that could be a factor
as well.

-- 
Jeff Layton <jlayton@xxxxxxxxxx>