Re: Possible memory leak in Ceph 14.0.1-1022.gc881d63

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11/28/18 3:06 PM, Erwan Velu wrote:
Hi,


The ceph-nano/container project offer users to use almost "master" code into containers.

We do consume the latest packages to test the future release of Ceph. Obviously that's not "supported" but interesting for testing.

In this example, the container have been built with Ceph 14.0.1-1022.gc881d63 rpms on a Centos container.

A user reported an OOM in one of his testing container (https://github.com/ceph/cn/issues/94).

Yes, the container is limited to 512MB of ram in Ceph-nano, that isn't much but wondered how much that value was the root cause of this OOM.


I bootep the same container twice and decided to monitor the memoire usage of this container in two contextes :

- an idle context as a reference

- a constant rados workload to see the impact of IOs on this issue


The Rados load was a simple loop of adding the ceph tree as objects, remove them and restart the loop.

I plotted the result (you can download it at http://pubz.free.fr/leak.png)


It appear that :

- an idle cluster, reached the 504 MB memory limit in 782 minutes. That means that's a 231 memory increase in 782 minutes; 17.7 MB/hour

- a working cluster, reached the 500 MB memory limit (when the OOM killed ceph-mon) in 692 minutes. That means that's a 229 MB memory increase in 692 minutes; 19.85MB/hour


That really looks like we do leak a lot of memory in this version and as the container is very limited in memory, put it to OOM state and die.

Does any of you have seen something similar ?


My next step is to monitor every ceph-process during that time to see what process is growing too fast.


Erwan,

Hi Erwan,


Out of curiosity did you look at the mempool stats at all? It's pretty likely you'll run out of memory with 512MB given our current defaults and the memory autotuner won't be able to keep up (it will do it's best, but can't work miracles).

To start out, the number of pglog entries is going to grow as time goes on. I don't know how many PGs you have, but figure that osd_min_pg_log_entries = 3000 and if you have say 100 PGs / OSD:

3000*100 =  300,000 * <size of pg log entry> bytes of memory usage.

Now figure that each pg log entry is going to store the name of the object, so if you have objects with short names that could be small, say:

100B * 300,000 =~ 30MB + overhead

But if you had objects with large names:

1KB * 300,000 =~ 300MB + overhead

That's not the only thing that will grow. By default in master the minimum autotuned size of the bluestore onode cache, buffer cache, and rocksdb block cache are set to 128MB each. That doesn't mean they will all necessarily be full at any given time, but it's quite likely that over time you'll have some data in each of those.

Another potential user of memory is the rocksdb WAL buffers (ie memtables). By default we currently will fill up a 256MB WAL buffer and once it's full start compacting it into L0 while new writes will go to up to 3 additional 256MB buffers before any new writes are stalled. In practice it doesn't look like we often have all 4 full and the autotuner will adjust the caches down if filling them would exceed the OSD memory target, but it's another factor to consider.

In practice if I set the osd_memory_target to 1GB the autotuner will make the OSD RSS memory bounce between like 800MB-1.3GB and constantly wrestle to keep things under control. It would probably do a bit better with a lower osd_min_pg_log_entry value (500? 1000?), lower minimum cache sizes (maybe 64MB) and a lower WAL buffer size/count (say 2 64MB buffers). That might be enough to support 1GB OSDs, but I'm not sure it would be enough to keep the OSD consistently under 512MB.

In any event, when I've tested OSDs with that little memory there's been fairly dramatic performance impacts in a variety of ways depending on what you change. In practice the minimum amount of memory we can reasonable work with right now is probably around 1.5-2GB, and we do a lot better with 3-4GB+.

Mark








[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux