RAM usage only very slowly decreases after cluster recovery

Chad William Seys <cwseys@xxxxxxxxxxxxxxxx> · Thu, 27 Aug 2015 15:59:31 -0500

Hi all,

It appears that OSD daemons only very slowly free RAM after an extended period 
of an unhealthy cluster (shuffling PGs around).

Prior to a power outage (and recovery) around July 25th, the amount of RAM 
used was fairly constant, at most 10GB (out of 24GB).  You can see in the 
attached PNG "osd6_stack2.png" (Week 30) that the amount of used RAM on 
osd06.physics.wisc.edu was holding steady around 7GB.

Around July 25th our Ceph cluster rebooted after a power outage.  Not all 
nodes booted successfully, so Ceph proceeded to shuffle PGs to attempt to return 
to health with the renaming nodes.  You can see in "osd6_stack2.png" two 
purplish spikes showing that the node used around 10GB swap space during the 
recovery period.

Finally the cluster recovered around July 31st.  During that period some I had 
to take some osd daemons out of the pool b/c their nodes ran out of swap space 
and the daemons were killed by the out of memory (OOM) kernel feature.  (The 
recovery period was probably extended by me trying to add the daemons/drives 
back. If I recall correctly that is what was occurring during the second swap 
peak.)

This RAM usage pattern is in generalthe same for all the nodes in the cluster.

Almost three weeks later, the amount of RAM used on the node is still 
decreasing, but it has not returned to pre-power outage levels. 15GB instead 
of 7GB.

Why is Ceph using 2x more RAM than it used to in steady state?

Thanks,
Chad.

(P.S.  It is really unfortunate that Ceph uses more RAM when recovering - can 
lead to cascading failure!)
Attachment:
osd6_stack2.png

Description: PNG image
Attachment:
osd6_line.png

Description: PNG image
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com