Dear Mark, thanks for the quick answer. I would try the memory profiler if I could find any documentation on it. In fact, I just guessed the "heap stats" command and have a hard time finding anything on the OSD daemon commands. Could you possibly point me to something? Also how to interpret the mempools? Is it correct to say that out of the memory_target only the mempools total is actually used and the remaining memory is lost due to leaks? For example, for OSD 256 I get the stats below after just 2 months uptime. Am I looking at a 5.5GB memory leak here? # ceph config get osd.256 osd_memory_target 8589934592 # ceph daemon osd.256 heap stats osd.256 tcmalloc heap stats:------------------------------------------------ MALLOC: 7216067616 ( 6881.8 MiB) Bytes in use by application MALLOC: + 229376 ( 0.2 MiB) Bytes in page heap freelist MALLOC: + 1222913888 ( 1166.3 MiB) Bytes in central cache freelist MALLOC: + 278016 ( 0.3 MiB) Bytes in transfer cache freelist MALLOC: + 18446744073692937856 (17592186044400.2 MiB) Bytes in thread cache freelists MALLOC: + 52166656 ( 49.8 MiB) Bytes in malloc metadata MALLOC: ------------ MALLOC: = 8475041792 ( 8082.4 MiB) Actual memory used (physical + swap) MALLOC: + 2010464256 ( 1917.3 MiB) Bytes released to OS (aka unmapped) MALLOC: ------------ MALLOC: = 10485506048 ( 9999.8 MiB) Virtual address space used MALLOC: MALLOC: 765182 Spans in use MALLOC: 48 Thread heaps in use MALLOC: 8192 Tcmalloc page size ------------------------------------------------ # ceph daemon osd.256 dump_mempools { "mempool": { "by_pool": { "bloom_filter": { "items": 0, "bytes": 0 }, "bluestore_alloc": { "items": 2300682, "bytes": 18405456 }, "bluestore_cache_data": { "items": 52390, "bytes": 306843648 }, "bluestore_cache_onode": { "items": 256153, "bytes": 145494904 }, "bluestore_cache_other": { "items": 92199353, "bytes": 656620069 }, "bluestore_fsck": { "items": 0, "bytes": 0 }, "bluestore_txc": { "items": 4, "bytes": 2752 }, "bluestore_writing_deferred": { "items": 122, "bytes": 1864924 }, "bluestore_writing": { "items": 3673, "bytes": 18440192 }, "bluefs": { "items": 11867, "bytes": 220504 }, "buffer_anon": { "items": 353734, "bytes": 1180837372 }, "buffer_meta": { "items": 91646, "bytes": 5865344 }, "osd": { "items": 134, "bytes": 1557616 }, "osd_mapbl": { "items": 84, "bytes": 8479562 }, "osd_pglog": { "items": 487004, "bytes": 166094788 }, "osdmap": { "items": 117697, "bytes": 2080280 }, "osdmap_mapping": { "items": 0, "bytes": 0 }, "pgmap": { "items": 0, "bytes": 0 }, "mds_co": { "items": 0, "bytes": 0 }, "unittest_1": { "items": 0, "bytes": 0 }, "unittest_2": { "items": 0, "bytes": 0 } }, "total": { "items": 95874543, "bytes": 2512807411 } } } Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Mark Nelson <mnelson@xxxxxxxxxx> Sent: 13 July 2020 15:39:50 To: ceph-users@xxxxxxx Subject: Re: OSD memory leak? Hi Frank, So the osd_memory_target code will basically shrink the size of the bluestore and rocksdb caches to attempt to keep the overall mapped (not rss!) memory of the process below the target. It's sort of "best effort" in that it can't guarantee the process will fit within a given target, it will just (assuming we are over target) shrink the caches up to some minimum value and that's it. 2GB per OSD is a pretty ambitious target. It's the lowest osd_memory_target we recommend setting. I'm a little surprised the OSD is consuming this much memory with a 2GB target though. Looking at your mempool dump I see very little memory allocated to the caches. In fact the majority is taken up by osdmap (looks like you have a decent number of OSDs) and pglog. That indicates that the memory autotuning is probably working but simply can't do anything more to help. Something else is taking up the memory. Figure you've got a little shy of 500MB for the mempools. RocksDB will take up more (and potentially quite a bit more if you have memtables backing up waiting to be flushed to L0) and potentially some other things in the OSD itself that could take up memory. If you feel comfortable experimenting, you could try changing the rocksdb WAL/memtable settings. By default we have up to 4 256MB WAL buffers. Instead you could try something like 2 64MB buffers, but be aware this could cause slow performance or even temporary write stalls if you have fast storage. Still, this would only give you up to ~0.9GB back. Since you are on mimic, you might also want to check what your kernel's transparent huge pages configuration is. I don't remember if we backported Patrick's fix to always avoid THP for ceph processes. If your kernel is set to "always", you might consider trying it with "madvise". Alternately, have you tried the built-in tcmalloc heap profiler? You might be able to get a better sense of where memory is being used with that as well. Mark On 7/13/20 7:07 AM, Frank Schilder wrote: > Hi all, > > on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load. > > What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats: > > Before restart: > osd.101 tcmalloc heap stats:------------------------------------------------ > MALLOC: 3438940768 ( 3279.6 MiB) Bytes in use by application > MALLOC: + 5611520 ( 5.4 MiB) Bytes in page heap freelist > MALLOC: + 257307352 ( 245.4 MiB) Bytes in central cache freelist > MALLOC: + 357376 ( 0.3 MiB) Bytes in transfer cache freelist > MALLOC: + 6727368 ( 6.4 MiB) Bytes in thread cache freelists > MALLOC: + 25559040 ( 24.4 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: = 3734503424 ( 3561.5 MiB) Actual memory used (physical + swap) > MALLOC: + 575946752 ( 549.3 MiB) Bytes released to OS (aka unmapped) > MALLOC: ------------ > MALLOC: = 4310450176 ( 4110.8 MiB) Virtual address space used > MALLOC: > MALLOC: 382884 Spans in use > MALLOC: 35 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ > # ceph daemon osd.101 dump_mempools > { > "mempool": { > "by_pool": { > "bloom_filter": { > "items": 0, > "bytes": 0 > }, > "bluestore_alloc": { > "items": 4691828, > "bytes": 37534624 > }, > "bluestore_cache_data": { > "items": 0, > "bytes": 0 > }, > "bluestore_cache_onode": { > "items": 51, > "bytes": 28968 > }, > "bluestore_cache_other": { > "items": 5761276, > "bytes": 46292425 > }, > "bluestore_fsck": { > "items": 0, > "bytes": 0 > }, > "bluestore_txc": { > "items": 67, > "bytes": 46096 > }, > "bluestore_writing_deferred": { > "items": 208, > "bytes": 26037057 > }, > "bluestore_writing": { > "items": 52, > "bytes": 6789398 > }, > "bluefs": { > "items": 9478, > "bytes": 183720 > }, > "buffer_anon": { > "items": 291450, > "bytes": 28093473 > }, > "buffer_meta": { > "items": 546, > "bytes": 34944 > }, > "osd": { > "items": 98, > "bytes": 1139152 > }, > "osd_mapbl": { > "items": 78, > "bytes": 8204276 > }, > "osd_pglog": { > "items": 341944, > "bytes": 120607952 > }, > "osdmap": { > "items": 10687217, > "bytes": 186830528 > }, > "osdmap_mapping": { > "items": 0, > "bytes": 0 > }, > "pgmap": { > "items": 0, > "bytes": 0 > }, > "mds_co": { > "items": 0, > "bytes": 0 > }, > "unittest_1": { > "items": 0, > "bytes": 0 > }, > "unittest_2": { > "items": 0, > "bytes": 0 > } > }, > "total": { > "items": 21784293, > "bytes": 461822613 > } > } > } > > Right after restart + health_ok: > osd.101 tcmalloc heap stats:------------------------------------------------ > MALLOC: 1173996280 ( 1119.6 MiB) Bytes in use by application > MALLOC: + 3727360 ( 3.6 MiB) Bytes in page heap freelist > MALLOC: + 25493688 ( 24.3 MiB) Bytes in central cache freelist > MALLOC: + 17101824 ( 16.3 MiB) Bytes in transfer cache freelist > MALLOC: + 20301904 ( 19.4 MiB) Bytes in thread cache freelists > MALLOC: + 5242880 ( 5.0 MiB) Bytes in malloc metadata > MALLOC: ------------ > MALLOC: = 1245863936 ( 1188.1 MiB) Actual memory used (physical + swap) > MALLOC: + 20488192 ( 19.5 MiB) Bytes released to OS (aka unmapped) > MALLOC: ------------ > MALLOC: = 1266352128 ( 1207.7 MiB) Virtual address space used > MALLOC: > MALLOC: 54160 Spans in use > MALLOC: 33 Thread heaps in use > MALLOC: 8192 Tcmalloc page size > ------------------------------------------------ > > Am I looking at a memory leak here or are these heap stats expected? > > I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx