Hi all,
We experienced some serious trouble with our cluster: A running cluster
started failing and started a chain reaction until the ceph cluster was
down, as about half the OSDs are down (in a EC pool)
Each host has 8 OSDS of 8 TB (i.e. RAID 0 of 2 4TB disk) for an EC pool
(10+3, 14 hosts) and 2 cache OSDS and 32 GB of RAM.
The reason we have the Raid0 of the disks, is because we tried with 16
disk before, but 32GB didn't seem enough to keep the cluster stable
We don't know for sure what triggered the chain reaction, but what we
certainly see, is that while recovering, our OSDS are using a lot of
memory. We've seen some OSDS using almost 8GB of RAM (resident; virtual
11GB)
So right now we don't have enough memory to recover the cluster, because
the OSDS get killed by OOMkiller before they can recover..
And I don't know doubling our memory will be enough..
A few questions:
* Does someone has seen this before?
* 2GB was still normal, but 8GB seems a lot, is this expected behaviour?
* We didn't see this with an nearly empty cluster. Now it was filled
about 1/4 (270TB). I guess it would become worse when filled half or more?
* How high can this memory usage become ? Can we calculate the maximum
memory of an OSD? Can we limit it ?
* We can upgrade/reinstall to infernalis, will that solve anything?
This is related to a previous post of me :
http://permalink.gmane.org/gmane.comp.file-systems.ceph.user/22259
Thank you very much !!
Kenneth
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com