Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

Mark Nelson <mark.a.nelson@xxxxxxxxx> · Wed, 28 Nov 2018 15:41:55 -0600

On 11/28/18 3:06 PM, Erwan Velu wrote:
Hi,

The ceph-nano/container project offer users to use almost "master" code 
into containers.

We do consume the latest packages to test the future release of Ceph. 
Obviously that's not "supported" but interesting for testing.

In this example, the container have been built with Ceph 
14.0.1-1022.gc881d63 rpms on a Centos container.

A user reported an OOM in one of his testing container 
(https://github.com/ceph/cn/issues/94).

Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't 
much but wondered how much that value was the root cause of this OOM.

I bootep the same container twice and decided to monitor the memoire 
usage of this container in two contextes :

- an idle context as a reference

- a constant rados workload to see the impact of IOs on this issue

The Rados load was a simple loop of adding the ceph tree as objects, 
remove them and restart the loop.

I plotted the result (you can download it at http://pubz.free.fr/leak.png)

It appear that :

- an idle cluster, reached the 504 MB memory limit in 782 minutes. That 
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour

- a working cluster, reached the 500 MB memory limit (when the OOM 
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory 
increase in 692 minutes; 19.85MB/hour

That really looks like we do leak a lot of memory in this version and as 
the container is very limited in memory, put it to OOM state and die.

Does any of you have seen something similar ?

My next step is to monitor every ceph-process during that time to see 
what process is growing too fast.

Erwan,

Hi Erwan,

Out of curiosity did you look at the mempool stats at all?  It's pretty 
likely you'll run out of memory with 512MB given our current defaults 
and the memory autotuner won't be able to keep up (it will do it's best, 
but can't work miracles).

To start out, the number of pglog entries is going to grow as time goes 
on.  I don't know how many PGs you have, but figure that 
osd_min_pg_log_entries = 3000 and if you have say 100 PGs / OSD:

3000*100 =  300,000 * <size of pg log entry> bytes of memory usage.

Now figure that each pg log entry is going to store the name of the 
object, so if you have objects with short names that could be small, say:

100B * 300,000 =~ 30MB + overhead

But if you had objects with large names:

1KB * 300,000 =~ 300MB + overhead

That's not the only thing that will grow.  By default in master the 
minimum autotuned size of the bluestore onode cache, buffer cache, and 
rocksdb block cache are set to 128MB each.  That doesn't mean they will 
all necessarily be full at any given time, but it's quite likely that 
over time you'll have some data in each of those.

Another potential user of memory is the rocksdb WAL buffers (ie 
memtables).  By default we currently will fill up a 256MB WAL buffer and 
once it's full start compacting it into L0 while new writes will go to 
up to 3 additional 256MB buffers before any new writes are stalled.  In 
practice it doesn't look like we often have all 4 full and the autotuner 
will adjust the caches down if filling them would exceed the OSD memory 
target, but it's another factor to consider.

In practice if I set the osd_memory_target to 1GB the autotuner will 
make the OSD RSS memory bounce between like 800MB-1.3GB and constantly 
wrestle to keep things under control.  It would probably do a bit better 
with a lower osd_min_pg_log_entry value (500? 1000?), lower minimum 
cache sizes (maybe 64MB) and a lower WAL buffer size/count (say 2 64MB 
buffers).  That might be enough to support 1GB OSDs, but I'm not sure it 
would be enough to keep the OSD consistently under 512MB.

In any event, when I've tested OSDs with that little memory there's been 
fairly dramatic performance impacts in a variety of ways depending on 
what you change.  In practice the minimum amount of memory we can 
reasonable work with right now is probably around 1.5-2GB, and we do a 
lot better with 3-4GB+.

Mark