Hi,
The ceph-nano/container project offer users to use almost "master" code
into containers.
We do consume the latest packages to test the future release of Ceph.
Obviously that's not "supported" but interesting for testing.
In this example, the container have been built with Ceph
14.0.1-1022.gc881d63 rpms on a Centos container.
A user reported an OOM in one of his testing container
(https://github.com/ceph/cn/issues/94).
Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't
much but wondered how much that value was the root cause of this OOM.
I bootep the same container twice and decided to monitor the memoire
usage of this container in two contextes :
- an idle context as a reference
- a constant rados workload to see the impact of IOs on this issue
The Rados load was a simple loop of adding the ceph tree as objects,
remove them and restart the loop.
I plotted the result (you can download it at http://pubz.free.fr/leak.png)
It appear that :
- an idle cluster, reached the 504 MB memory limit in 782 minutes. That
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour
- a working cluster, reached the 500 MB memory limit (when the OOM
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory
increase in 692 minutes; 19.85MB/hour
That really looks like we do leak a lot of memory in this version and as
the container is very limited in memory, put it to OOM state and die.
Does any of you have seen something similar ?
My next step is to monitor every ceph-process during that time to see
what process is growing too fast.
Erwan,