Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 29 Nov 2018 09:50:47 -0800

I think this was discussed some on irc but I wasn't online at the
time, and to keep some updates on the list...

On Thu, Nov 29, 2018 at 6:59 AM Erwan Velu <evelu@xxxxxxxxxx> wrote:
>
>
> > Hi Erwan,
> >
> >
> > Out of curiosity did you look at the mempool stats at all?  It's
> > pretty likely you'll run out of memory with 512MB given our current
> > defaults and the memory autotuner won't be able to keep up (it will do
> > it's best, but can't work miracles).
>
>
> As per the ceph-nano project, the cluster is very simple with the
> following configuration :
>
>    cluster:
>      id:     b2eafdd3-ec87-4107-afaf-521980bb3d9e
>      health: HEALTH_OK
>
>    services:
>      mon: 1 daemons, quorum ceph-nano-travis-faa32aebf00b (age 2m)
>      mgr: ceph-nano-travis-faa32aebf00b(active, since 2m)
>      osd: 1 osds: 1 up (since 2m), 1 in (since 2m)
>      rgw: 1 daemon active
>
>    data:
>      pools:   5 pools, 40 pgs
>      objects: 174 objects, 1.6 KiB
>      usage:   1.0 GiB used, 9.0 GiB / 10 GiB avail
>      pgs:     40 active+clean
>
>
> I'll save the mempool stats over time to see what is growing on the idle
> case.
>
> >
> > [...]
>
> > In any event, when I've tested OSDs with that little memory there's
> > been fairly dramatic performance impacts in a variety of ways
> > depending on what you change.  In practice the minimum amount of
> > memory we can reasonable work with right now is probably around
> > 1.5-2GB, and we do a lot better with 3-4GB+.
> In the ceph-nano context, we don't really target performance.

The issue here is that ceph-nano doesn't target performance, but Ceph,
while we're trying to add more auto-tuning, really doesn't do a good
job of that itself. So since you haven't made any attempt to configure
the Ceph daemons to use less memory, they're going to run the way they
always do, assuming something like 2GB+ per OSD, 512+MB per monitor,
plus more for the manager (do we have numbers on that?) and rgw
(highly variable).

And all the data you've shown us says "a newly-deployed cluster will
use more memory while it spins itself up to steady-state". If you run
it in a larger container, apply a workload to it, and can show that a
specific daemon is continuing to use more memory on an ongoing basis,
that might be concerning. But from what you've said so far, it's just
spinning things up.
Cutting down what Mark said a bit:
the OSD is going to keep using more memory for every write until it
hits its per-PG limits. (RGW is probably prompting some cluster writes
even while idle, which is polluting your "idle" data a bit, but that's
part of why you're seeing an active cluster use more.)
The OSD is going to do more internal bookkeeping with BlueStore until
it hits its limits.
The OSD is going to keep reporting PG stats to the monitor/manager
even if they don't change, and tracking those will use more memory on
the monitor and manager until they hit their limits. (I imagine that's
a big part of the "idle" cluster usage increase as well.)
-Greg