Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Thu, 29 Nov 2018 13:45:01 -0600

On 11/29/18 1:13 PM, Gregory Farnum wrote:
On Thu, Nov 29, 2018 at 10:10 AM Mark Nelson <mnelson@xxxxxxxxxx> wrote:

"Working Well" is probably our current default 4GB osd_memory_target
value. ;)

Realistically we can probably get away with 2GB and do "ok", though
we'll have a lot more cache misses over time.  Anything below 1.5GB is
likely going to require additional tweaks to pull off, and it's probably
going to result in significant slow downs in various ways until we
re-architect things like pglog, our threading model, omap cache, etc.
It's not clear to me what kind of impact during recovery we would see
either (beyond the tendancy to backfill due to shorter pglog lengths).

Well the good news is that for this use case you should just turn off
your "performance!" instincts because we almost completely Don't Care.
;) Behaving badly during recovery doesn't matter for ceph-nano because
there are no replicas to recover from anyway. If it can handle a
developer typing s3 PUT requests into a command line, it satisfies the
projects needs.

But, it does need to go through the effort of identifying and setting
the limits or else this will happen.
-Greg

Well, in that case go nuts. ;)

environment variable for OSD:

TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=33554432

ceph.conf:

osd_memory_target = 536870912
osd_memory_base = 268435456
osd_memory_cache_min = 33554432
bluestore_cache_autotune_chunk_size = 8388608
osd_min_pg_log_entries = 10
osd_max_pg_log_entries = 10
osd_pg_log_dups_tracked = 10
osd_pg_log_trim_min = 10
bluestore_rocksdb_options=compression=kNoCompression,max_write_buffer_number=2,min_write_buffer_number_to_merge=1,recycle_log_file_num=2,write_buffer_size=8388608,writable_file_max_buffer_size=0,compaction_readahead_size=2097152

and maybe screw around with various other throttles and limits, though 
I'm not sure how much it would matter if there are no replicas.  If it's 
1 OSD you could use 8 PGs or so for the pool too. ;)

Mark

On 11/29/18 12:01 PM, Brett Niver wrote:
If we're thinking about a laptop user - we probably do want to try and
have a config that works well out of the box.   Just like vstart :-)
Wouldn't setting some limits allow us to better determine if indeed
we're leaking memory?

On Thu, Nov 29, 2018 at 12:52 PM Gregory Farnum <gfarnum@xxxxxxxxxx> wrote:
I think this was discussed some on irc but I wasn't online at the
time, and to keep some updates on the list...

On Thu, Nov 29, 2018 at 6:59 AM Erwan Velu <evelu@xxxxxxxxxx> wrote:

Hi Erwan,

Out of curiosity did you look at the mempool stats at all?  It's
pretty likely you'll run out of memory with 512MB given our current
defaults and the memory autotuner won't be able to keep up (it will do
it's best, but can't work miracles).

As per the ceph-nano project, the cluster is very simple with the
following configuration :

     cluster:
       id:     b2eafdd3-ec87-4107-afaf-521980bb3d9e
       health: HEALTH_OK

     services:
       mon: 1 daemons, quorum ceph-nano-travis-faa32aebf00b (age 2m)
       mgr: ceph-nano-travis-faa32aebf00b(active, since 2m)
       osd: 1 osds: 1 up (since 2m), 1 in (since 2m)
       rgw: 1 daemon active

     data:
       pools:   5 pools, 40 pgs
       objects: 174 objects, 1.6 KiB
       usage:   1.0 GiB used, 9.0 GiB / 10 GiB avail
       pgs:     40 active+clean

I'll save the mempool stats over time to see what is growing on the idle
case.

[...]
In any event, when I've tested OSDs with that little memory there's
been fairly dramatic performance impacts in a variety of ways
depending on what you change.  In practice the minimum amount of
memory we can reasonable work with right now is probably around
1.5-2GB, and we do a lot better with 3-4GB+.
In the ceph-nano context, we don't really target performance.
The issue here is that ceph-nano doesn't target performance, but Ceph,
while we're trying to add more auto-tuning, really doesn't do a good
job of that itself. So since you haven't made any attempt to configure
the Ceph daemons to use less memory, they're going to run the way they
always do, assuming something like 2GB+ per OSD, 512+MB per monitor,
plus more for the manager (do we have numbers on that?) and rgw
(highly variable).

And all the data you've shown us says "a newly-deployed cluster will
use more memory while it spins itself up to steady-state". If you run
it in a larger container, apply a workload to it, and can show that a
specific daemon is continuing to use more memory on an ongoing basis,
that might be concerning. But from what you've said so far, it's just
spinning things up.
Cutting down what Mark said a bit:
the OSD is going to keep using more memory for every write until it
hits its per-PG limits. (RGW is probably prompting some cluster writes
even while idle, which is polluting your "idle" data a bit, but that's
part of why you're seeing an active cluster use more.)
The OSD is going to do more internal bookkeeping with BlueStore until
it hits its limits.
The OSD is going to keep reporting PG stats to the monitor/manager
even if they don't change, and tracking those will use more memory on
the monitor and manager until they hit their limits. (I imagine that's
a big part of the "idle" cluster usage increase as well.)
-Greg