Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Thu, 29 Nov 2018 11:55:31 -0600

On 11/29/18 11:50 AM, Gregory Farnum wrote:
I think this was discussed some on irc but I wasn't online at the
time, and to keep some updates on the list...

On Thu, Nov 29, 2018 at 6:59 AM Erwan Velu <evelu@xxxxxxxxxx> wrote:

Hi Erwan,

Out of curiosity did you look at the mempool stats at all?  It's
pretty likely you'll run out of memory with 512MB given our current
defaults and the memory autotuner won't be able to keep up (it will do
it's best, but can't work miracles).

As per the ceph-nano project, the cluster is very simple with the
following configuration :

    cluster:
      id:     b2eafdd3-ec87-4107-afaf-521980bb3d9e
      health: HEALTH_OK

    services:
      mon: 1 daemons, quorum ceph-nano-travis-faa32aebf00b (age 2m)
      mgr: ceph-nano-travis-faa32aebf00b(active, since 2m)
      osd: 1 osds: 1 up (since 2m), 1 in (since 2m)
      rgw: 1 daemon active

    data:
      pools:   5 pools, 40 pgs
      objects: 174 objects, 1.6 KiB
      usage:   1.0 GiB used, 9.0 GiB / 10 GiB avail
      pgs:     40 active+clean

I'll save the mempool stats over time to see what is growing on the idle
case.

[...]

In any event, when I've tested OSDs with that little memory there's
been fairly dramatic performance impacts in a variety of ways
depending on what you change.  In practice the minimum amount of
memory we can reasonable work with right now is probably around
1.5-2GB, and we do a lot better with 3-4GB+.
In the ceph-nano context, we don't really target performance.

The issue here is that ceph-nano doesn't target performance, but Ceph,
while we're trying to add more auto-tuning, really doesn't do a good
job of that itself. So since you haven't made any attempt to configure
the Ceph daemons to use less memory, they're going to run the way they
always do, assuming something like 2GB+ per OSD, 512+MB per monitor,
plus more for the manager (do we have numbers on that?) and rgw
(highly variable).

And all the data you've shown us says "a newly-deployed cluster will
use more memory while it spins itself up to steady-state". If you run
it in a larger container, apply a workload to it, and can show that a
specific daemon is continuing to use more memory on an ongoing basis,
that might be concerning. But from what you've said so far, it's just
spinning things up.
Cutting down what Mark said a bit:
the OSD is going to keep using more memory for every write until it
hits its per-PG limits. (RGW is probably prompting some cluster writes
even while idle, which is polluting your "idle" data a bit, but that's
part of why you're seeing an active cluster use more.)
The OSD is going to do more internal bookkeeping with BlueStore until
it hits its limits.
The OSD is going to keep reporting PG stats to the monitor/manager
even if they don't change, and tracking those will use more memory on
the monitor and manager until they hit their limits. (I imagine that's
a big part of the "idle" cluster usage increase as well.)
-Greg

Also just based on the two heap stat reports in irc, the increase in 
memory appeared to primarily be related to additional tcmalloc 
threadcache being used slowly over time.  I forgot to mention in the 
earlier email that we also tell tcmalloc by default to allocate 128MB of 
threadcache, again, as a performance consideration.  This was absolutely 
a big deal in the simple messenger days, but still may be relevant with 
bluestore at least given some wallclock profiles we've seen.

Mark