Re: MDS uses up to 150 GByte of memory during journal replay

"Yan, Zheng" <ukernel@xxxxxxxxx> · Mon, 7 Jan 2019 13:39:48 +0800

likely caused by http://tracker.ceph.com/issues/37399.

Regards
Yan, Zheng

On Sat, Jan 5, 2019 at 5:44 PM Matthias Aebi <maebi@xxxxxxxxx> wrote:
>
> Hello everyone,
>
> We are running a small cluster on 5 machines with 48 OSDs / 5 MDSs / 5 MONs based on Luminous 12.2.10 and Debian Stretch 9.6. When using a single MDS configuration everything works fine and looking at the active MDS's memory, as configured, it uses ~1 GByte of memory for cache:
>
> $ watch ceph tell mds.$(hostname) heap stats
>
> mds.e tcmalloc heap stats:------------------------------------------------
> MALLOC:     1172867096 ( 1118.5 MiB) Bytes in use by application
> MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
> MALLOC: +     39289912 (   37.5 MiB) Bytes in central cache freelist
> MALLOC: +     17245344 (   16.4 MiB) Bytes in transfer cache freelist
> MALLOC: +     34303760 (   32.7 MiB) Bytes in thread cache freelists
> MALLOC: +      5796032 (    5.5 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =   1269502144 ( 1210.7 MiB) Actual memory used (physical + swap)
> MALLOC: +     19775488 (   18.9 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =   1289277632 ( 1229.6 MiB) Virtual address space used
> MALLOC:
> MALLOC:          70430              Spans in use
> MALLOC:             17              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> -------------
> $ ceph versions
>
> {
>  "mon": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 5
>  },
>  "mgr": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 3
>  },
>  "osd": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 48
>  },
>  "mds": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 5
>  },
>  "overall": {
>      "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 61
>  }
>
> -------------
> $ ceph -s
>
> cluster:
>  id:     .... c9024
>  health: HEALTH_OK
>
> services:
>  mon: 5 daemons, quorum a,b,c,d,e
>  mgr: libra(active), standbys: b, a
>  mds: cephfs-1/1/1 up  {0=e=up:active}, 1 up:standby-replay, 3 up:standby
>  osd: 48 osds: 48 up, 48 in
>
> data:
>  pools:   2 pools, 2052 pgs
>  objects: 44.44M objects, 52.3TiB
>  usage:   107TiB used, 108TiB / 216TiB avail
>  pgs:     2051 active+clean
>           1    active+clean+scrubbing+deep
>
> io:
>  client:   85.3KiB/s rd, 3.17MiB/s wr, 45op/s rd, 26op/s wr
> -------------
>
> However as soon as we use "ceph fs set cephfs max_mds 2" to add a second MDS to the picture things get out of hand within seconds, although in a rather unexpected way: The standby MDS server which is brought in works fine and shown a normal memory consumption. However the two machines which are starting to replay the journal in order to become standby servers start to accumulate dozens of GByte of memory immediately and go up to about 150 GByte of memory, almost immediately starting to use swap space, which brings load up to about 80 within seconds and makes all other processes (mainly OSDs) unreachable.
>
> As the machine becomes basically unreachable when this happens it is only possible to get memory statistics when things start to wrong. After that it's not possible to get a memory dump anymore as the OS as a whole gets blocked by swapping.
>
> $ watch ceph tell mds.$(hostname) heap stats
>
> mds.a tcmalloc heap stats:------------------------------------------------
> MALLOC:    36113137024 (34440.2 MiB) Bytes in use by application
> MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
> MALLOC: +      7723144 (    7.4 MiB) Bytes in central cache freelist
> MALLOC: +      2523264 (    2.4 MiB) Bytes in transfer cache freelist
> MALLOC: +      2460024 (    2.3 MiB) Bytes in thread cache freelists
> MALLOC: +     41185472 (   39.3 MiB) Bytes in malloc metadata
> MALLOC:   ------------
> MALLOC: =  36167028928 (34491.6 MiB) Actual memory used (physical + swap)
> MALLOC: +      1417216 (    1.4 MiB) Bytes released to OS (aka unmapped)
> MALLOC:   ------------
> MALLOC: =  36168446144 (34492.9 MiB) Virtual address space used
> MALLOC:
> MALLOC:          38476              Spans in use
> MALLOC:             13              Thread heaps in use
> MALLOC:           8192              Tcmalloc page size
> -------------
>
> Please also find attached the zip'ed log file of one of the two new standby MDSs when it is trying to replay the fs journal.
>
> As soon as the number of MDSs is set back to 1 (using "ceph fs set cephfs max_mds 1" and "ceph mds deactivate 1") things start to calm down and the cluster goes back to normal. Is this a known problem with Luminous and what can be done to solve it anyway so the multi MDS feature may be used?
>
> As all servers used here are based on Debian it is unfortunately not possible to upgrade to Mimic as it seems that this cannot be / will not be made available for Debian Stretch due to the tool chain issue described elsewhere.
>
> Thank you for any help and pointers in the right direction!
>
> Best,
> Matthias
>
> ----------------------------------------------------------------------------------------------------
> dizmo - The Interface of Things
> http://www.dizmo.com, Phone +41 52 267 88 50, Twitter @dizmos
> dizmo inc, Universitätsstrasse 53, CH-8006 Zurich, Switzerland
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com