Re: Ceph Multi Mds Trim Log Slow

Lars Täuber <taeuber@xxxxxxx> · Fri, 3 May 2019 11:45:41 +0200

Hi,

I'm still new to ceph. Here are similar problems with CephFS.

ceph version 14.2.0 (3a54b2b6d167d4a2a19e003a705696d4fe619afc) nautilus (stable)
on Debian GNU/Linux buster/sid

# ceph health detail
HEALTH_WARN 1 MDSs report slow requests; 1 MDSs behind on trimming
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsmds3(mds.0): 13 slow requests are blocked > 30 secs
MDS_TRIM 1 MDSs behind on trimming
    mdsmds3(mds.0): Behind on trimming (33924/125) max_segments: 125, num_segments: 33924

The workload is "doveadm backup" of more than 500 mail folders from a local ext4 to a cephfs.
* There are ~180'000 files with a strange file size distribution:

# NumSamples = 181056; MIN_SEEN = 377; MAX_SEEN = 584835624
# Mean = 4477785.646005; Variance = 31526763457775.421875; SD = 5614869.852256
        377 -     262502 [ 56652]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 31.29%
     262502 -     524627 [  4891]: ∎∎∎∎  2.70%
     524627 -     786752 [  3498]: ∎∎∎  1.93%
     786752 -    1048878 [  2770]: ∎∎∎  1.53%
    1048878 -    1311003 [  2460]: ∎∎  1.36%
    1311003 -    1573128 [  2197]: ∎∎  1.21%
    1573128 -    1835253 [  2014]: ∎∎  1.11%
    1835253 -    2097378 [  1961]: ∎∎  1.08%
    2097378 -    2359503 [  2244]: ∎∎  1.24%
    2359503 -    2621628 [  1890]: ∎∎  1.04%
    2621628 -    2883754 [  1897]: ∎∎  1.05%
    2883754 -    3145879 [  2188]: ∎∎  1.21%
    3145879 -    3408004 [  2579]: ∎∎  1.42%
    3408004 -    3670129 [  3396]: ∎∎∎  1.88%
    3670129 -    3932254 [  5173]: ∎∎∎∎  2.86%
    3932254 -    4194379 [ 24847]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 13.72%
    4194379 -    4456505 [  1512]: ∎∎  0.84%
    4456505 -    4718630 [  1394]: ∎∎  0.77%
    4718630 -    4980755 [  1412]: ∎∎  0.78%
    4980755 -  584835624 [ 56081]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 30.97%

* There are two snapshots of the main directory the mails are backed up to.
* There are three sub directories where a simple ls doesn't return from.
* The cephfs is mounted using the kernel driver of Ubuntu 18.04.2 LTS kernel 4.15.0-48-generic.
* Same behaviour with ceph-fuse 'FUSE library version: 2.9.7' with the difference that I can't interrupt the ls.

The reduction of the number of mds working for our cephfs to 1 made no difference.
The number of segments is still rising.
# ceph -w
  cluster:
    id:     6cba13d1-b814-489c-9aac-9c04aaf78720
    health: HEALTH_WARN
            1 MDSs report slow requests
            1 MDSs behind on trimming

  services:
    mon: 3 daemons, quorum mon1,mon2,mon3 (age 3d)
    mgr: cephsible(active, since 27h), standbys: mon3, mon1
    mds: cephfs_1:2 {0=mds3=up:active,1=mds2=up:stopping} 1 up:standby
    osd: 30 osds: 30 up (since 4w), 30 in (since 5w)

  data:
    pools:   5 pools, 393 pgs
    objects: 607.74k objects, 1.5 TiB
    usage:   6.9 TiB used, 160 TiB / 167 TiB avail
    pgs:     393 active+clean

2019-05-03 11:40:17.916193 mds.mds3 [WRN] 15 slow requests, 0 included below; oldest blocked for > 342610.193367 secs

It seems the stopping of one out of two mds doesn't come to an end.

How to debug this?

Thanks in advance.
Lars
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com