Re: OSD memory leak?

Mark Nelson <mnelson@xxxxxxxxxx> · Tue, 14 Jul 2020 10:22:15 -0500

On 7/14/20 8:12 AM, Frank Schilder wrote:
Dear Mark,

thanks for the info. I forgot a few answers:

THPes are disabled (set to "never"). The kernel almost certainly doesn't reclaim because there is not enough pressure yet.

We have 268 OSDs. I would not consider this much. We plan to triple that soonish. In the past, the minimum recommendation was 1GB RAM per HDD blue store OSD. I'm actually not really happy about that this has been quadrupled for not really convincing reasons. Compared with other storage systems, the increase in minimum requirements really start making ceph expensive.

1GB of process memory for a single HDD backed bluestore OSD was never 
the recommendation.  Prior to the memory autotuning we had the overall 
bluestore cache size set to 1GB for HDDs, but that does not mean that a 
bluestore OSD could ever consistently fit in a 1GB memory envelope.  It 
would be 1GB of cache plus whatever else the OSD needed to run.  In the 
filestore days we did (usually!) use less memory because we used the 
global page cache more than dedicated per-daemon caches.  That works 
well for slower disks but isn't necessarily a great model if you've got 
a box full of NVMe drives and/or want to have more control over what 
gets cached and how.  Generally the trend (and not just for Ceph) is to 
get the kernel out of the way and shard everything as much as possible.  
The page cache doesn't really fit well with that model as storage keeps 
getting faster and faster.  Having said all of that, it wasn't 
necessarily uncommon for filestore OSDs to use 1-2GB of RAM either 
especially during recovery and if the pglog is full.  The big difference 
is that now we are actually working toward trying to keep the OSD (and 
other daemons) within a certain memory boundary and historically we 
didn't do that at all.

We have set the OSDs to use the bitmap allocator. Is the fact that we get tcmalloc stats a contradiction to this?

No, the bitmap allocator won't prevent you from getting tcmalloc stats.  
tcmalloc controls memory allocations, the bitmap allocator controls disk 
allocations.

I did not consider upgrading from mimic, because a lot of people report stability issues that might be caused by a regression in the message queueing. There was a longer e-mail about clusters from nautilus and higher collapsing under trivial amounts of rebalancing, pool deletion and other admin tasks. Before I consider upgrading, I want to test this on a lab cluster we plan to set up soon.

I will look at the memory profiling. If one can use this on a production system, I will give it a go.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 14 July 2020 14:48:36
To: Frank Schilder; ceph-users@xxxxxxx
Subject: Re:  Re: OSD memory leak?

Hi Frank,

These might help:

https://docs.ceph.com/docs/master/rados/troubleshooting/memory-profiling/

https://gperftools.github.io/gperftools/heapprofile.html

https://gperftools.github.io/gperftools/heap_checker.html

Regarding the mempools, they don't track all of the memory usage in
Ceph, only things that were allocated using mempools.  There are many
other things (rocksdb block cache for example) that don't use them.
It's only giving you a partial picture of memory usage.  In your example
below, that byte value from the thread cache freelist looks very wrong.
Ignoring that for a moment though, there's a ton of memory that's been
unmapped and released to the OS, but hasn't been reclaimed by the
kernel.  That's either because the kernel doesn't have enough memory
pressure to bother reclaiming it, or because it's all fragmented chunks
of a huge page that the kernel can't fully reclaim.  That tells me you
should definitely be looking at the transparent huge page (THP)
configuration on your nodes.  Looking back at batrick's PR that disables
THP for Ceph, it looks like we only backported it to nautilus but not
mimic.  On that topic, have you considered upgrading to Nautilus?

Mark

On 7/14/20 2:56 AM, Frank Schilder wrote:
Dear Mark,

thanks for the quick answer. I would try the memory profiler if I could find any documentation on it. In fact, I just guessed the "heap stats" command and have a hard time finding anything on the OSD daemon commands. Could you possibly point me to something? Also how to interpret the mempools? Is it correct to say that out of the memory_target only the mempools total is actually used and the remaining memory is lost due to leaks?

For example, for OSD 256 I get the stats below after just 2 months uptime. Am I looking at a 5.5GB memory leak here?

# ceph config get osd.256 osd_memory_target
8589934592

# ceph daemon osd.256 heap stats
osd.256 tcmalloc heap stats:------------------------------------------------
MALLOC:     7216067616 ( 6881.8 MiB) Bytes in use by application
MALLOC: +       229376 (    0.2 MiB) Bytes in page heap freelist
MALLOC: +   1222913888 ( 1166.3 MiB) Bytes in central cache freelist
MALLOC: +       278016 (    0.3 MiB) Bytes in transfer cache freelist
MALLOC: + 18446744073692937856 (17592186044400.2 MiB) Bytes in thread cache freelists
MALLOC: +     52166656 (   49.8 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   8475041792 ( 8082.4 MiB) Actual memory used (physical + swap)
MALLOC: +   2010464256 ( 1917.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  10485506048 ( 9999.8 MiB) Virtual address space used
MALLOC:
MALLOC:         765182              Spans in use
MALLOC:             48              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

# ceph daemon osd.256 dump_mempools
{
      "mempool": {
          "by_pool": {
              "bloom_filter": {
                  "items": 0,
                  "bytes": 0
              },
              "bluestore_alloc": {
                  "items": 2300682,
                  "bytes": 18405456
              },
              "bluestore_cache_data": {
                  "items": 52390,
                  "bytes": 306843648
              },
              "bluestore_cache_onode": {
                  "items": 256153,
                  "bytes": 145494904
              },
              "bluestore_cache_other": {
                  "items": 92199353,
                  "bytes": 656620069
              },
              "bluestore_fsck": {
                  "items": 0,
                  "bytes": 0
              },
              "bluestore_txc": {
                  "items": 4,
                  "bytes": 2752
              },
              "bluestore_writing_deferred": {
                  "items": 122,
                  "bytes": 1864924
              },
              "bluestore_writing": {
                  "items": 3673,
                  "bytes": 18440192
              },
              "bluefs": {
                  "items": 11867,
                  "bytes": 220504
              },
              "buffer_anon": {
                  "items": 353734,
                  "bytes": 1180837372
              },
              "buffer_meta": {
                  "items": 91646,
                  "bytes": 5865344
              },
              "osd": {
                  "items": 134,
                  "bytes": 1557616
              },
              "osd_mapbl": {
                  "items": 84,
                  "bytes": 8479562
              },
              "osd_pglog": {
                  "items": 487004,
                  "bytes": 166094788
              },
              "osdmap": {
                  "items": 117697,
                  "bytes": 2080280
              },
              "osdmap_mapping": {
                  "items": 0,
                  "bytes": 0
              },
              "pgmap": {
                  "items": 0,
                  "bytes": 0
              },
              "mds_co": {
                  "items": 0,
                  "bytes": 0
              },
              "unittest_1": {
                  "items": 0,
                  "bytes": 0
              },
              "unittest_2": {
                  "items": 0,
                  "bytes": 0
              }
          },
          "total": {
              "items": 95874543,
              "bytes": 2512807411
          }
      }
}

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Mark Nelson <mnelson@xxxxxxxxxx>
Sent: 13 July 2020 15:39:50
To: ceph-users@xxxxxxx
Subject:  Re: OSD memory leak?

Hi Frank,

So the osd_memory_target code will basically shrink the size of the
bluestore and rocksdb caches to attempt to keep the overall mapped (not
rss!) memory of the process below the target.  It's sort of "best
effort" in that it can't guarantee the process will fit within a given
target, it will just (assuming we are over target) shrink the caches up
to some minimum value and that's it. 2GB per OSD is a pretty ambitious
target.  It's the lowest osd_memory_target we recommend setting.  I'm a
little surprised the OSD is consuming this much memory with a 2GB target
though.

Looking at your mempool dump I see very little memory allocated to the
caches.  In fact the majority is taken up by osdmap (looks like you have
a decent number of OSDs) and pglog.  That indicates that the memory
autotuning is probably working but simply can't do anything more to
help.  Something else is taking up the memory. Figure you've got a
little shy of 500MB for the mempools.  RocksDB will take up more (and
potentially quite a bit more if you have memtables backing up waiting to
be flushed to L0) and potentially some other things in the OSD itself
that could take up memory.  If you feel comfortable experimenting, you
could try changing the rocksdb WAL/memtable settings.  By default we
have up to 4 256MB WAL buffers.  Instead you could try something like 2
64MB buffers, but be aware this could cause slow performance or even
temporary write stalls if you have fast storage.  Still, this would only
give you up to ~0.9GB back.  Since you are on mimic, you might also want
to check what your kernel's transparent huge pages configuration is.  I
don't remember if we backported Patrick's fix to always avoid THP for
ceph processes.  If your kernel is set to "always", you might consider
trying it with "madvise".

Alternately, have you tried the built-in tcmalloc heap profiler? You
might be able to get a better sense of where memory is being used with
that as well.

Mark

On 7/13/20 7:07 AM, Frank Schilder wrote:
Hi all,

on a mimic 13.2.8 cluster I observe a gradual increase of memory usage by OSD daemons, in particular, under heavy load. For our spinners I use osd_memory_target=2G. The daemons overrun the 2G in virt size rather quickly and grow to something like 4G virtual. The real memory consumption stays more or less around the 2G of the target. There are some overshoots, but these go down again during periods with less load.

What I observe now is that the actual memory consumption slowly grows and OSDs start using more than 2G virtual memory. I see this as slowly growing swap usage despite having more RAM available (swappiness=10). This indicates allocated but unused memory or memory not accessed for a long time, usually a leak. Here some heap stats:

Before restart:
osd.101 tcmalloc heap stats:------------------------------------------------
MALLOC:     3438940768 ( 3279.6 MiB) Bytes in use by application
MALLOC: +      5611520 (    5.4 MiB) Bytes in page heap freelist
MALLOC: +    257307352 (  245.4 MiB) Bytes in central cache freelist
MALLOC: +       357376 (    0.3 MiB) Bytes in transfer cache freelist
MALLOC: +      6727368 (    6.4 MiB) Bytes in thread cache freelists
MALLOC: +     25559040 (   24.4 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   3734503424 ( 3561.5 MiB) Actual memory used (physical + swap)
MALLOC: +    575946752 (  549.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   4310450176 ( 4110.8 MiB) Virtual address space used
MALLOC:
MALLOC:         382884              Spans in use
MALLOC:             35              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
# ceph daemon osd.101 dump_mempools
{
       "mempool": {
           "by_pool": {
               "bloom_filter": {
                   "items": 0,
                   "bytes": 0
               },
               "bluestore_alloc": {
                   "items": 4691828,
                   "bytes": 37534624
               },
               "bluestore_cache_data": {
                   "items": 0,
                   "bytes": 0
               },
               "bluestore_cache_onode": {
                   "items": 51,
                   "bytes": 28968
               },
               "bluestore_cache_other": {
                   "items": 5761276,
                   "bytes": 46292425
               },
               "bluestore_fsck": {
                   "items": 0,
                   "bytes": 0
               },
               "bluestore_txc": {
                   "items": 67,
                   "bytes": 46096
               },
               "bluestore_writing_deferred": {
                   "items": 208,
                   "bytes": 26037057
               },
               "bluestore_writing": {
                   "items": 52,
                   "bytes": 6789398
               },
               "bluefs": {
                   "items": 9478,
                   "bytes": 183720
               },
               "buffer_anon": {
                   "items": 291450,
                   "bytes": 28093473
               },
               "buffer_meta": {
                   "items": 546,
                   "bytes": 34944
               },
               "osd": {
                   "items": 98,
                   "bytes": 1139152
               },
               "osd_mapbl": {
                   "items": 78,
                   "bytes": 8204276
               },
               "osd_pglog": {
                   "items": 341944,
                   "bytes": 120607952
               },
               "osdmap": {
                   "items": 10687217,
                   "bytes": 186830528
               },
               "osdmap_mapping": {
                   "items": 0,
                   "bytes": 0
               },
               "pgmap": {
                   "items": 0,
                   "bytes": 0
               },
               "mds_co": {
                   "items": 0,
                   "bytes": 0
               },
               "unittest_1": {
                   "items": 0,
                   "bytes": 0
               },
               "unittest_2": {
                   "items": 0,
                   "bytes": 0
               }
           },
           "total": {
               "items": 21784293,
               "bytes": 461822613
           }
       }
}

Right after restart + health_ok:
osd.101 tcmalloc heap stats:------------------------------------------------
MALLOC:     1173996280 ( 1119.6 MiB) Bytes in use by application
MALLOC: +      3727360 (    3.6 MiB) Bytes in page heap freelist
MALLOC: +     25493688 (   24.3 MiB) Bytes in central cache freelist
MALLOC: +     17101824 (   16.3 MiB) Bytes in transfer cache freelist
MALLOC: +     20301904 (   19.4 MiB) Bytes in thread cache freelists
MALLOC: +      5242880 (    5.0 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   1245863936 ( 1188.1 MiB) Actual memory used (physical + swap)
MALLOC: +     20488192 (   19.5 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   1266352128 ( 1207.7 MiB) Virtual address space used
MALLOC:
MALLOC:          54160              Spans in use
MALLOC:             33              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

Am I looking at a memory leak here or are these heap stats expected?

I don't mind the swap usage, it doesn't have impact. I'm just wondering if I need to restart OSDs regularly. The "leakage" above occurred within only 2 months.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx