Possible memory leak in Ceph 14.0.1-1022.gc881d63

Erwan Velu <evelu@xxxxxxxxxx> · Wed, 28 Nov 2018 22:06:35 +0100

Hi,

The ceph-nano/container project offer users to use almost "master" code 
into containers.

We do consume the latest packages to test the future release of Ceph. 
Obviously that's not "supported" but interesting for testing.

In this example, the container have been built with Ceph 
14.0.1-1022.gc881d63 rpms on a Centos container.

A user reported an OOM in one of his testing container 
(https://github.com/ceph/cn/issues/94).

Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't 
much but wondered how much that value was the root cause of this OOM.

I bootep the same container twice and decided to monitor the memoire 
usage of this container in two contextes :

- an idle context as a reference

- a constant rados workload to see the impact of IOs on this issue

The Rados load was a simple loop of adding the ceph tree as objects, 
remove them and restart the loop.

I plotted the result (you can download it at http://pubz.free.fr/leak.png)

It appear that :

- an idle cluster, reached the 504 MB memory limit in 782 minutes. That 
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour

- a working cluster, reached the 500 MB memory limit (when the OOM 
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory 
increase in 692 minutes; 19.85MB/hour

That really looks like we do leak a lot of memory in this version and as 
the container is very limited in memory, put it to OOM state and die.

Does any of you have seen something similar ?

My next step is to monitor every ceph-process during that time to see 
what process is growing too fast.

Erwan,