Re: OSD memory leak?

Frank Schilder <frans@xxxxxx> · Mon, 16 Nov 2020 14:06:04 +0000

Dear all,

I collected memory allocation data over a period of 2 months; see the graphs here: <https://imgur.com/a/R0q6nzP>. I need to revise my statement about accelerated growth. The new graphs indicate that we are looking at linear growth, that is, probably a small memory leak in a regularly called function. I think the snippets of the heap stats and memory profiling output below should give a clue about where to look.

Osd 195 is using about 2.1GB more than it should, the memory limit is 2GB:

osd.195 tcmalloc heap stats:------------------------------------------------
MALLOC:     4555926984 ( 4344.9 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +    288132120 (  274.8 MiB) Bytes in central cache freelist
MALLOC: +     12879104 (   12.3 MiB) Bytes in transfer cache freelist
MALLOC: +     20619552 (   19.7 MiB) Bytes in thread cache freelists
MALLOC: +     33292288 (   31.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   4910850048 ( 4683.4 MiB) Actual memory used (physical + swap)
MALLOC: +    865198080 (  825.1 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   5776048128 ( 5508.5 MiB) Virtual address space used
MALLOC:
MALLOC:         470779              Spans in use
MALLOC:             35              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.
{
    "error": "(0) Success",
    "success": true
}

It looks like the vast majority of this leak occurs in "ceph::decode"; see the top of the heap profiler allocation stats:

Total: 4011.6 MB
  1567.9  39.1%  39.1%   1815.3  45.3% ceph::decode
   457.7  11.4%  50.5%    457.7  11.4% rocksdb::BlockFetcher::ReadBlockContents
   269.7   6.7%  57.2%    269.7   6.7% std::vector::_M_default_append
   256.0   6.4%  63.6%    256.0   6.4% rocksdb::Arena::AllocateNewBlock
   243.9   6.1%  69.7%    243.9   6.1% std::_Rb_tree::_M_emplace_hint_unique
   184.1   4.6%  74.3%    184.1   4.6% CrushWrapper::get_leaves
   174.6   4.4%  78.6%    174.6   4.4% ceph::buffer::create_aligned_in_mempool
   170.3   4.2%  82.9%    170.3   4.2% ceph::buffer::malformed_input::what
   125.2   3.1%  86.0%    191.4   4.8% PGLog::IndexedLog::add
   101.1   2.5%  88.5%    101.1   2.5% CrushWrapper::decode_crush_bucket

Does this already help? If not, I collected 126GB of data from the heap profiler.

It would be great if this leak could be closed. It would be enough to extend the uptime of an OSD to cover usual maintenance windows.

By the way, increasing the cache_min value helped a lot. The OSD kept a healthy amount of ONODE items in cache despite the leak. Users noticed the improvement.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 31 August 2020 19:50:57
To: Mark Nelson; Dan van der Ster; ceph-users
Subject:  Re: OSD memory leak?

Looks like the image attachment got removed. Please find it here: https://imgur.com/a/3tabzCN

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Frank Schilder <frans@xxxxxx>
Sent: 31 August 2020 14:42
To: Mark Nelson; Dan van der Ster; ceph-users
Subject:  Re: OSD memory leak?

Hi Dan and Mark,

sorry, took a bit longer. I uploaded a new archive containing files with the following format (https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - valid 60 days):

- osd.195.profile.*.heap - raw heap dump file
- osd.195.profile.*.heap.txt - output of conversion with --text
- osd.195.profile.*.heap-base0001.txt - output of conversion with --text against first dump as base
- osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour
- osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour
- osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, counters are reset

Only for the last couple of days are converted files included, post-conversion of everything simply takes too long.

Please find also attached a recording of memory usage on one of the relevant OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What is worrying is the self-amplifying nature of the leak. ts not a linear process, it looks at least quadratic if not exponential. What we are looking for is, given the comparably short uptime, probably still in the lower percentages with increasing rate. The OSDs just started to overrun their limit:

top - 14:38:49 up 155 days, 19:17,  1 user,  load average: 5.99, 4.59, 4.59
Tasks: 684 total,   1 running, 293 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.9 us,  0.9 sy,  0.0 ni, 89.6 id,  7.6 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 65727628 total,  6937548 free, 41921260 used, 16868820 buff/cache
KiB Swap: 93532160 total, 90199040 free,  3333120 used.  6740136 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
4099023 ceph      20   0 5918704   3.8g   9700 S   1.7  6.1 378:37.01 /usr/bin/ceph-osd --cluster ceph -f -i 35 --setuser cep+
4097639 ceph      20   0 5340924   3.0g  11428 S  87.1  4.7  14636:30 /usr/bin/ceph-osd --cluster ceph -f -i 195 --setuser ce+
4097974 ceph      20   0 3648188   2.3g   9628 S   8.3  3.6   1375:58 /usr/bin/ceph-osd --cluster ceph -f -i 201 --setuser ce+
4098322 ceph      20   0 3478980   2.2g   9688 S   5.3  3.6   1426:05 /usr/bin/ceph-osd --cluster ceph -f -i 223 --setuser ce+
4099374 ceph      20   0 3446784   2.2g   9252 S   4.6  3.5   1142:14 /usr/bin/ceph-osd --cluster ceph -f -i 205 --setuser ce+
4098679 ceph      20   0 3832140   2.2g   9796 S   6.6  3.5   1248:26 /usr/bin/ceph-osd --cluster ceph -f -i 132 --setuser ce+
4100782 ceph      20   0 3641608   2.2g   9652 S   7.9  3.5   1278:10 /usr/bin/ceph-osd --cluster ceph -f -i 207 --setuser ce+
4095944 ceph      20   0 3375672   2.2g   8968 S   7.3  3.5   1250:02 /usr/bin/ceph-osd --cluster ceph -f -i 108 --setuser ce+
4096956 ceph      20   0 3509376   2.2g   9456 S   7.9  3.5   1157:27 /usr/bin/ceph-osd --cluster ceph -f -i 203 --setuser ce+
4099731 ceph      20   0 3563652   2.2g   8972 S   3.6  3.5   1421:48 /usr/bin/ceph-osd --cluster ceph -f -i 61 --setuser cep+
4096262 ceph      20   0 3531988   2.2g   9040 S   9.9  3.5   1600:15 /usr/bin/ceph-osd --cluster ceph -f -i 121 --setuser ce+
4100442 ceph      20   0 3359736   2.1g   9804 S   4.3  3.4   1185:53 /usr/bin/ceph-osd --cluster ceph -f -i 226 --setuser ce+
4096617 ceph      20   0 3443060   2.1g   9432 S   5.0  3.4   1449:29 /usr/bin/ceph-osd --cluster ceph -f -i 199 --setuser ce+
4097298 ceph      20   0 3483532   2.1g   9600 S   5.6  3.3   1265:28 /usr/bin/ceph-osd --cluster ceph -f -i 97 --setuser cep+
4100093 ceph      20   0 3428348   2.0g   9568 S   3.3  3.2   1298:53 /usr/bin/ceph-osd --cluster ceph -f -i 197 --setuser ce+
4095630 ceph      20   0 3440160   2.0g   8976 S   3.6  3.2   1451:35 /usr/bin/ceph-osd --cluster ceph -f -i 62 --setuser cep+

Generally speaking, increasing the cache minimum seems to help with keeping important information in RAM. Unfortunately, it also means that swap usage starts much earlier.

Best regards and thanks for your help,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx