Dear all, I collected memory allocation data over a period of 2 months; see the graphs here: <https://imgur.com/a/R0q6nzP>. I need to revise my statement about accelerated growth. The new graphs indicate that we are looking at linear growth, that is, probably a small memory leak in a regularly called function. I think the snippets of the heap stats and memory profiling output below should give a clue about where to look. Osd 195 is using about 2.1GB more than it should, the memory limit is 2GB: osd.195 tcmalloc heap stats:------------------------------------------------ MALLOC: 4555926984 ( 4344.9 MiB) Bytes in use by application MALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelist MALLOC: + 288132120 ( 274.8 MiB) Bytes in central cache freelist MALLOC: + 12879104 ( 12.3 MiB) Bytes in transfer cache freelist MALLOC: + 20619552 ( 19.7 MiB) Bytes in thread cache freelists MALLOC: + 33292288 ( 31.8 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 4910850048 ( 4683.4 MiB) Actual memory used (physical + swap) MALLOC: + 865198080 ( 825.1 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 5776048128 ( 5508.5 MiB) Virtual address space used MALLOC: MALLOC: 470779 Spans in use MALLOC: 35 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()). Bytes released to the OS take up virtual address space but no physical memory. { "error": "(0) Success", "success": true } It looks like the vast majority of this leak occurs in "ceph::decode"; see the top of the heap profiler allocation stats: Total: 4011.6 MB 1567.9 39.1% 39.1% 1815.3 45.3% ceph::decode 457.7 11.4% 50.5% 457.7 11.4% rocksdb::BlockFetcher::ReadBlockContents 269.7 6.7% 57.2% 269.7 6.7% std::vector::_M_default_append 256.0 6.4% 63.6% 256.0 6.4% rocksdb::Arena::AllocateNewBlock 243.9 6.1% 69.7% 243.9 6.1% std::_Rb_tree::_M_emplace_hint_unique 184.1 4.6% 74.3% 184.1 4.6% CrushWrapper::get_leaves 174.6 4.4% 78.6% 174.6 4.4% ceph::buffer::create_aligned_in_mempool 170.3 4.2% 82.9% 170.3 4.2% ceph::buffer::malformed_input::what 125.2 3.1% 86.0% 191.4 4.8% PGLog::IndexedLog::add 101.1 2.5% 88.5% 101.1 2.5% CrushWrapper::decode_crush_bucket Does this already help? If not, I collected 126GB of data from the heap profiler. It would be great if this leak could be closed. It would be enough to extend the uptime of an OSD to cover usual maintenance windows. By the way, increasing the cache_min value helped a lot. The OSD kept a healthy amount of ONODE items in cache despite the leak. Users noticed the improvement. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 31 August 2020 19:50:57 To: Mark Nelson; Dan van der Ster; ceph-users Subject: Re: OSD memory leak? Looks like the image attachment got removed. Please find it here: https://imgur.com/a/3tabzCN ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: 31 August 2020 14:42 To: Mark Nelson; Dan van der Ster; ceph-users Subject: Re: OSD memory leak? Hi Dan and Mark, sorry, took a bit longer. I uploaded a new archive containing files with the following format (https://files.dtu.dk/u/jb0uS6U9LlCfvS5L/heap_profiling-2020-08-31.tgz?l - valid 60 days): - osd.195.profile.*.heap - raw heap dump file - osd.195.profile.*.heap.txt - output of conversion with --text - osd.195.profile.*.heap-base0001.txt - output of conversion with --text against first dump as base - osd.195.*.heap_stats - output of ceph daemon osd.195 heap stats, every hour - osd.195.*.mempools - output of ceph daemon osd.195 dump_mempools, every hour - osd.195.*.perf - output of ceph daemon osd.195 perf dump, every hour, counters are reset Only for the last couple of days are converted files included, post-conversion of everything simply takes too long. Please find also attached a recording of memory usage on one of the relevant OSD nodes. I marked restarts of all OSDs/the host with vertical red lines. What is worrying is the self-amplifying nature of the leak. ts not a linear process, it looks at least quadratic if not exponential. What we are looking for is, given the comparably short uptime, probably still in the lower percentages with increasing rate. The OSDs just started to overrun their limit: top - 14:38:49 up 155 days, 19:17, 1 user, load average: 5.99, 4.59, 4.59 Tasks: 684 total, 1 running, 293 sleeping, 0 stopped, 0 zombie %Cpu(s): 1.9 us, 0.9 sy, 0.0 ni, 89.6 id, 7.6 wa, 0.0 hi, 0.1 si, 0.0 st KiB Mem : 65727628 total, 6937548 free, 41921260 used, 16868820 buff/cache KiB Swap: 93532160 total, 90199040 free, 3333120 used. 6740136 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 4099023 ceph 20 0 5918704 3.8g 9700 S 1.7 6.1 378:37.01 /usr/bin/ceph-osd --cluster ceph -f -i 35 --setuser cep+ 4097639 ceph 20 0 5340924 3.0g 11428 S 87.1 4.7 14636:30 /usr/bin/ceph-osd --cluster ceph -f -i 195 --setuser ce+ 4097974 ceph 20 0 3648188 2.3g 9628 S 8.3 3.6 1375:58 /usr/bin/ceph-osd --cluster ceph -f -i 201 --setuser ce+ 4098322 ceph 20 0 3478980 2.2g 9688 S 5.3 3.6 1426:05 /usr/bin/ceph-osd --cluster ceph -f -i 223 --setuser ce+ 4099374 ceph 20 0 3446784 2.2g 9252 S 4.6 3.5 1142:14 /usr/bin/ceph-osd --cluster ceph -f -i 205 --setuser ce+ 4098679 ceph 20 0 3832140 2.2g 9796 S 6.6 3.5 1248:26 /usr/bin/ceph-osd --cluster ceph -f -i 132 --setuser ce+ 4100782 ceph 20 0 3641608 2.2g 9652 S 7.9 3.5 1278:10 /usr/bin/ceph-osd --cluster ceph -f -i 207 --setuser ce+ 4095944 ceph 20 0 3375672 2.2g 8968 S 7.3 3.5 1250:02 /usr/bin/ceph-osd --cluster ceph -f -i 108 --setuser ce+ 4096956 ceph 20 0 3509376 2.2g 9456 S 7.9 3.5 1157:27 /usr/bin/ceph-osd --cluster ceph -f -i 203 --setuser ce+ 4099731 ceph 20 0 3563652 2.2g 8972 S 3.6 3.5 1421:48 /usr/bin/ceph-osd --cluster ceph -f -i 61 --setuser cep+ 4096262 ceph 20 0 3531988 2.2g 9040 S 9.9 3.5 1600:15 /usr/bin/ceph-osd --cluster ceph -f -i 121 --setuser ce+ 4100442 ceph 20 0 3359736 2.1g 9804 S 4.3 3.4 1185:53 /usr/bin/ceph-osd --cluster ceph -f -i 226 --setuser ce+ 4096617 ceph 20 0 3443060 2.1g 9432 S 5.0 3.4 1449:29 /usr/bin/ceph-osd --cluster ceph -f -i 199 --setuser ce+ 4097298 ceph 20 0 3483532 2.1g 9600 S 5.6 3.3 1265:28 /usr/bin/ceph-osd --cluster ceph -f -i 97 --setuser cep+ 4100093 ceph 20 0 3428348 2.0g 9568 S 3.3 3.2 1298:53 /usr/bin/ceph-osd --cluster ceph -f -i 197 --setuser ce+ 4095630 ceph 20 0 3440160 2.0g 8976 S 3.6 3.2 1451:35 /usr/bin/ceph-osd --cluster ceph -f -i 62 --setuser cep+ Generally speaking, increasing the cache minimum seems to help with keeping important information in RAM. Unfortunately, it also means that swap usage starts much earlier. Best regards and thanks for your help, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx