MDS uses up to 150 GByte of memory during journal replay

Matthias Aebi <maebi@xxxxxxxxx> · Sat, 5 Jan 2019 10:44:11 +0100

Hello everyone,

We are running a small cluster on 5 machines with 48 OSDs / 5 MDSs / 5 MONs based on Luminous 12.2.10 and Debian Stretch 9.6. When using a single MDS configuration everything works fine and looking at the active MDS's memory, as configured, it uses ~1 GByte of memory for cache:

$ watch ceph tell mds.$(hostname) heap stats

mds.e tcmalloc heap stats:------------------------------------------------
MALLOC:     1172867096 ( 1118.5 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +     39289912 (   37.5 MiB) Bytes in central cache freelist
MALLOC: +     17245344 (   16.4 MiB) Bytes in transfer cache freelist
MALLOC: +     34303760 (   32.7 MiB) Bytes in thread cache freelists
MALLOC: +      5796032 (    5.5 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   1269502144 ( 1210.7 MiB) Actual memory used (physical + swap)
MALLOC: +     19775488 (   18.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   1289277632 ( 1229.6 MiB) Virtual address space used
MALLOC:
MALLOC:          70430              Spans in use
MALLOC:             17              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------- 
$ ceph versions

{
 "mon": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 5
 },
 "mgr": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 3
 },
 "osd": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 48
 },
 "mds": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 5
 },
 "overall": {
     "ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94) luminous (stable)": 61
 }

------------- 
$ ceph -s

cluster:
 id:     .... c9024
 health: HEALTH_OK

services:
 mon: 5 daemons, quorum a,b,c,d,e
 mgr: libra(active), standbys: b, a
 mds: cephfs-1/1/1 up  {0=e=up:active}, 1 up:standby-replay, 3 up:standby
 osd: 48 osds: 48 up, 48 in

data:
 pools:   2 pools, 2052 pgs
 objects: 44.44M objects, 52.3TiB
 usage:   107TiB used, 108TiB / 216TiB avail
 pgs:     2051 active+clean
          1    active+clean+scrubbing+deep

io:
 client:   85.3KiB/s rd, 3.17MiB/s wr, 45op/s rd, 26op/s wr
------------- 

However as soon as we use "ceph fs set cephfs max_mds 2" to add a second MDS to the picture things get out of hand within seconds, although in a rather unexpected way: The standby MDS server which is brought in works fine and shown a normal memory consumption. However the two machines which are starting to replay the journal in order to become standby servers start to accumulate dozens of GByte of memory immediately and go up to about 150 GByte of memory, almost immediately starting to use swap space, which brings load up to about 80 within seconds and makes all other processes (mainly OSDs) unreachable.

As the machine becomes basically unreachable when this happens it is only possible to get memory statistics when things start to wrong. After that it's not possible to get a memory dump anymore as the OS as a whole gets blocked by swapping.

$ watch ceph tell mds.$(hostname) heap stats

mds.a tcmalloc heap stats:------------------------------------------------
MALLOC:    36113137024 (34440.2 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +      7723144 (    7.4 MiB) Bytes in central cache freelist
MALLOC: +      2523264 (    2.4 MiB) Bytes in transfer cache freelist
MALLOC: +      2460024 (    2.3 MiB) Bytes in thread cache freelists
MALLOC: +     41185472 (   39.3 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  36167028928 (34491.6 MiB) Actual memory used (physical + swap)
MALLOC: +      1417216 (    1.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  36168446144 (34492.9 MiB) Virtual address space used
MALLOC:
MALLOC:          38476              Spans in use
MALLOC:             13              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------- 

Please also find attached the zip'ed log file of one of the two new standby MDSs when it is trying to replay the fs journal.

As soon as the number of MDSs is set back to 1 (using "ceph fs set cephfs max_mds 1" and "ceph mds deactivate 1") things start to calm down and the cluster goes back to normal. Is this a known problem with Luminous and what can be done to solve it anyway so the multi MDS feature may be used?

As all servers used here are based on Debian it is unfortunately not possible to upgrade to Mimic as it seems that this cannot be / will not be made available for Debian Stretch due to the tool chain issue described elsewhere.

Thank you for any help and pointers in the right direction!

Best,
Matthias

----------------------------------------------------------------------------------------------------
dizmo - The Interface of Things
http://www.dizmo.com, Phone +41 52 267 88 50, Twitter @dizmos
dizmo inc, Universitätsstrasse 53, CH-8006 Zurich, Switzerland

Attachment:
Log of mds.b replaying fs journal.tbz

Description: Binary data
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com