Re: OSD memory leak?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 7/15/20 9:58 AM, Dan van der Ster wrote:
Hi Mark,

On Mon, Jul 13, 2020 at 3:42 PM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
Hi Frank,


So the osd_memory_target code will basically shrink the size of the
bluestore and rocksdb caches to attempt to keep the overall mapped (not
rss!) memory of the process below the target.  It's sort of "best
effort" in that it can't guarantee the process will fit within a given
target, it will just (assuming we are over target) shrink the caches up
to some minimum value and that's it. 2GB per OSD is a pretty ambitious
target.  It's the lowest osd_memory_target we recommend setting.  I'm a
little surprised the OSD is consuming this much memory with a 2GB target
though.

Looking at your mempool dump I see very little memory allocated to the
caches.  In fact the majority is taken up by osdmap (looks like you have
a decent number of OSDs) and pglog.  That indicates that the memory
Do you know if this high osdmap usage is known already?
Our big block storage cluster generates a new osdmap every few seconds
(due to rbd snap trimming) and we see the osdmap mempool usage growing
over a few months until osds start getting OOM killed.

Today we proactively restarted them because the osdmap_mempool was
using close to 700MB.
So it seems that whatever is supposed to be trimming is not working.
(This is observed with nautilus 14.2.8 but iirc it has been the same
even when we were running luminous and mimic too)

Cheers, Dan


Hrm, it hasn't been on my radar, though looking back through the mailing list there appears to be various reports over the years of high usage (some of which theoretically have been fixed).  Maybe submit a tracker issue?  700MB seems quite high for osdmap, but I don't really know the retention rules so someone else who knows that code better will have to chime in.



autotuning is probably working but simply can't do anything more to
help.  Something else is taking up the memory. Figure you've got a
little shy of 500MB for the mempools.  RocksDB will take up more (and
potentially quite a bit more if you have memtables backing up waiting to
be flushed to L0) and potentially some other things in the OSD itself
that could take up memory.  If you feel comfortable experimenting, you
could try changing the rocksdb WAL/memtable settings.  By default we
have up to 4 256MB WAL buffers.  Instead you could try something like 2
64MB buffers, but be aware this could cause slow performance or even
temporary write stalls if you have fast storage.  Still, this would only
give you up to ~0.9GB back.  Since you are on mimic, you might also want
to check what your kernel's transparent huge pages configuration is.  I
don't remember if we backported Patrick's fix to always avoid THP for
ceph processes.  If your kernel is set to "always", you might consider
trying it with "madvise".

Alternately, have you tried the built-in tcmalloc heap profiler? You
might be able to get a better sense of where memory is being used with
that as well.


Mark


On 7/13/20 7:07 AM, Frank Schilder wrote:
Hi all,

on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load.

What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats:

Before restart:
osd.101 tcmalloc heap stats:------------------------------------------------
MALLOC:     3438940768 ( 3279.6 MiB) Bytes in use by application
MALLOC: +      5611520 (    5.4 MiB) Bytes in page heap freelist
MALLOC: +    257307352 (  245.4 MiB) Bytes in central cache freelist
MALLOC: +       357376 (    0.3 MiB) Bytes in transfer cache freelist
MALLOC: +      6727368 (    6.4 MiB) Bytes in thread cache freelists
MALLOC: +     25559040 (   24.4 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   3734503424 ( 3561.5 MiB) Actual memory used (physical + swap)
MALLOC: +    575946752 (  549.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   4310450176 ( 4110.8 MiB) Virtual address space used
MALLOC:
MALLOC:         382884              Spans in use
MALLOC:             35              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
# ceph daemon osd.101 dump_mempools
{
      "mempool": {
          "by_pool": {
              "bloom_filter": {
                  "items": 0,
                  "bytes": 0
              },
              "bluestore_alloc": {
                  "items": 4691828,
                  "bytes": 37534624
              },
              "bluestore_cache_data": {
                  "items": 0,
                  "bytes": 0
              },
              "bluestore_cache_onode": {
                  "items": 51,
                  "bytes": 28968
              },
              "bluestore_cache_other": {
                  "items": 5761276,
                  "bytes": 46292425
              },
              "bluestore_fsck": {
                  "items": 0,
                  "bytes": 0
              },
              "bluestore_txc": {
                  "items": 67,
                  "bytes": 46096
              },
              "bluestore_writing_deferred": {
                  "items": 208,
                  "bytes": 26037057
              },
              "bluestore_writing": {
                  "items": 52,
                  "bytes": 6789398
              },
              "bluefs": {
                  "items": 9478,
                  "bytes": 183720
              },
              "buffer_anon": {
                  "items": 291450,
                  "bytes": 28093473
              },
              "buffer_meta": {
                  "items": 546,
                  "bytes": 34944
              },
              "osd": {
                  "items": 98,
                  "bytes": 1139152
              },
              "osd_mapbl": {
                  "items": 78,
                  "bytes": 8204276
              },
              "osd_pglog": {
                  "items": 341944,
                  "bytes": 120607952
              },
              "osdmap": {
                  "items": 10687217,
                  "bytes": 186830528
              },
              "osdmap_mapping": {
                  "items": 0,
                  "bytes": 0
              },
              "pgmap": {
                  "items": 0,
                  "bytes": 0
              },
              "mds_co": {
                  "items": 0,
                  "bytes": 0
              },
              "unittest_1": {
                  "items": 0,
                  "bytes": 0
              },
              "unittest_2": {
                  "items": 0,
                  "bytes": 0
              }
          },
          "total": {
              "items": 21784293,
              "bytes": 461822613
          }
      }
}

Right after restart + health_ok:
osd.101 tcmalloc heap stats:------------------------------------------------
MALLOC:     1173996280 ( 1119.6 MiB) Bytes in use by application
MALLOC: +      3727360 (    3.6 MiB) Bytes in page heap freelist
MALLOC: +     25493688 (   24.3 MiB) Bytes in central cache freelist
MALLOC: +     17101824 (   16.3 MiB) Bytes in transfer cache freelist
MALLOC: +     20301904 (   19.4 MiB) Bytes in thread cache freelists
MALLOC: +      5242880 (    5.0 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   1245863936 ( 1188.1 MiB) Actual memory used (physical + swap)
MALLOC: +     20488192 (   19.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   1266352128 ( 1207.7 MiB) Virtual address space used
MALLOC:
MALLOC:          54160              Spans in use
MALLOC:             33              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

Am I looking at a memory leak here or are these heap stats expected?

I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux