On 11/28/18 3:06 PM, Erwan Velu wrote:
Hi,
The ceph-nano/container project offer users to use almost "master" code
into containers.
We do consume the latest packages to test the future release of Ceph.
Obviously that's not "supported" but interesting for testing.
In this example, the container have been built with Ceph
14.0.1-1022.gc881d63 rpms on a Centos container.
A user reported an OOM in one of his testing container
(https://github.com/ceph/cn/issues/94).
Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't
much but wondered how much that value was the root cause of this OOM.
I bootep the same container twice and decided to monitor the memoire
usage of this container in two contextes :
- an idle context as a reference
- a constant rados workload to see the impact of IOs on this issue
The Rados load was a simple loop of adding the ceph tree as objects,
remove them and restart the loop.
I plotted the result (you can download it at http://pubz.free.fr/leak.png)
It appear that :
- an idle cluster, reached the 504 MB memory limit in 782 minutes. That
means that's a 231 memory increase in 782 minutes; 17.7 MB/hour
- a working cluster, reached the 500 MB memory limit (when the OOM
killed ceph-mon) in 692 minutes. That means that's a 229 MB memory
increase in 692 minutes; 19.85MB/hour
That really looks like we do leak a lot of memory in this version and as
the container is very limited in memory, put it to OOM state and die.
Does any of you have seen something similar ?
My next step is to monitor every ceph-process during that time to see
what process is growing too fast.
Erwan,
Hi Erwan,
Out of curiosity did you look at the mempool stats at all? It's pretty
likely you'll run out of memory with 512MB given our current defaults
and the memory autotuner won't be able to keep up (it will do it's best,
but can't work miracles).
To start out, the number of pglog entries is going to grow as time goes
on. I don't know how many PGs you have, but figure that
osd_min_pg_log_entries = 3000 and if you have say 100 PGs / OSD:
3000*100 = 300,000 * <size of pg log entry> bytes of memory usage.
Now figure that each pg log entry is going to store the name of the
object, so if you have objects with short names that could be small, say:
100B * 300,000 =~ 30MB + overhead
But if you had objects with large names:
1KB * 300,000 =~ 300MB + overhead
That's not the only thing that will grow. By default in master the
minimum autotuned size of the bluestore onode cache, buffer cache, and
rocksdb block cache are set to 128MB each. That doesn't mean they will
all necessarily be full at any given time, but it's quite likely that
over time you'll have some data in each of those.
Another potential user of memory is the rocksdb WAL buffers (ie
memtables). By default we currently will fill up a 256MB WAL buffer and
once it's full start compacting it into L0 while new writes will go to
up to 3 additional 256MB buffers before any new writes are stalled. In
practice it doesn't look like we often have all 4 full and the autotuner
will adjust the caches down if filling them would exceed the OSD memory
target, but it's another factor to consider.
In practice if I set the osd_memory_target to 1GB the autotuner will
make the OSD RSS memory bounce between like 800MB-1.3GB and constantly
wrestle to keep things under control. It would probably do a bit better
with a lower osd_min_pg_log_entry value (500? 1000?), lower minimum
cache sizes (maybe 64MB) and a lower WAL buffer size/count (say 2 64MB
buffers). That might be enough to support 1GB OSDs, but I'm not sure it
would be enough to keep the OSD consistently under 512MB.
In any event, when I've tested OSDs with that little memory there's been
fairly dramatic performance impacts in a variety of ways depending on
what you change. In practice the minimum amount of memory we can
reasonable work with right now is probably around 1.5-2GB, and we do a
lot better with 3-4GB+.
Mark