Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 29 Nov 2018 11:13:41 -0800

On Thu, Nov 29, 2018 at 10:10 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote:
>
> "Working Well" is probably our current default 4GB osd_memory_target
> value. ;)
>
> Realistically we can probably get away with 2GB and do "ok", though
> we'll have a lot more cache misses over time.  Anything below 1.5GB is
> likely going to require additional tweaks to pull off, and it's probably
> going to result in significant slow downs in various ways until we
> re-architect things like pglog, our threading model, omap cache, etc.
> It's not clear to me what kind of impact during recovery we would see
> either (beyond the tendancy to backfill due to shorter pglog lengths).

Well the good news is that for this use case you should just turn off
your "performance!" instincts because we almost completely Don't Care.
;) Behaving badly during recovery doesn't matter for ceph-nano because
there are no replicas to recover from anyway. If it can handle a
developer typing s3 PUT requests into a command line, it satisfies the
projects needs.

But, it does need to go through the effort of identifying and setting
the limits or else this will happen.
-Greg

>
> Mark
>
> On 11/29/18 12:01 PM, Brett Niver wrote:
> > If we're thinking about a laptop user - we probably do want to try and
> > have a config that works well out of the box.   Just like vstart :-)
> > Wouldn't setting some limits allow us to better determine if indeed
> > we're leaking memory?
> >
> >
> > On Thu, Nov 29, 2018 at 12:52 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
> >> I think this was discussed some on irc but I wasn't online at the
> >> time, and to keep some updates on the list...
> >>
> >> On Thu, Nov 29, 2018 at 6:59 AM Erwan Velu <evelu@xxxxxxxxxx> wrote:
> >>>
> >>>> Hi Erwan,
> >>>>
> >>>>
> >>>> Out of curiosity did you look at the mempool stats at all?  It's
> >>>> pretty likely you'll run out of memory with 512MB given our current
> >>>> defaults and the memory autotuner won't be able to keep up (it will do
> >>>> it's best, but can't work miracles).
> >>>
> >>> As per the ceph-nano project, the cluster is very simple with the
> >>> following configuration :
> >>>
> >>>     cluster:
> >>>       id:     b2eafdd3-ec87-4107-afaf-521980bb3d9e
> >>>       health: HEALTH_OK
> >>>
> >>>     services:
> >>>       mon: 1 daemons, quorum ceph-nano-travis-faa32aebf00b (age 2m)
> >>>       mgr: ceph-nano-travis-faa32aebf00b(active, since 2m)
> >>>       osd: 1 osds: 1 up (since 2m), 1 in (since 2m)
> >>>       rgw: 1 daemon active
> >>>
> >>>     data:
> >>>       pools:   5 pools, 40 pgs
> >>>       objects: 174 objects, 1.6 KiB
> >>>       usage:   1.0 GiB used, 9.0 GiB / 10 GiB avail
> >>>       pgs:     40 active+clean
> >>>
> >>>
> >>> I'll save the mempool stats over time to see what is growing on the idle
> >>> case.
> >>>
> >>>> [...]
> >>>> In any event, when I've tested OSDs with that little memory there's
> >>>> been fairly dramatic performance impacts in a variety of ways
> >>>> depending on what you change.  In practice the minimum amount of
> >>>> memory we can reasonable work with right now is probably around
> >>>> 1.5-2GB, and we do a lot better with 3-4GB+.
> >>> In the ceph-nano context, we don't really target performance.
> >> The issue here is that ceph-nano doesn't target performance, but Ceph,
> >> while we're trying to add more auto-tuning, really doesn't do a good
> >> job of that itself. So since you haven't made any attempt to configure
> >> the Ceph daemons to use less memory, they're going to run the way they
> >> always do, assuming something like 2GB+ per OSD, 512+MB per monitor,
> >> plus more for the manager (do we have numbers on that?) and rgw
> >> (highly variable).
> >>
> >> And all the data you've shown us says "a newly-deployed cluster will
> >> use more memory while it spins itself up to steady-state". If you run
> >> it in a larger container, apply a workload to it, and can show that a
> >> specific daemon is continuing to use more memory on an ongoing basis,
> >> that might be concerning. But from what you've said so far, it's just
> >> spinning things up.
> >> Cutting down what Mark said a bit:
> >> the OSD is going to keep using more memory for every write until it
> >> hits its per-PG limits. (RGW is probably prompting some cluster writes
> >> even while idle, which is polluting your "idle" data a bit, but that's
> >> part of why you're seeing an active cluster use more.)
> >> The OSD is going to do more internal bookkeeping with BlueStore until
> >> it hits its limits.
> >> The OSD is going to keep reporting PG stats to the monitor/manager
> >> even if they don't change, and tracking those will use more memory on
> >> the monitor and manager until they hit their limits. (I imagine that's
> >> a big part of the "idle" cluster usage increase as well.)
> >> -Greg