You may remember our recent issue, I believe this is being worsened if not caused by another problem we have encountered.
Every few days our nodes are (non-simultaneously) being fenced due to corosync taking up vast amounts of memory (i.e. 100% of the box). Please see a sample log message, we have several just like this, [1] which occurs when this happens. Note that it is not always corosync being killed - but it is clearly corosync eating all the memory (see top output from three servers at various times since their last reboot, [2] [3] [4]).
The corosync version is 1.2.3:
[g@cluster1 ~]$ corosync -v
Corosync Cluster Engine, version '1.2.3'
Copyright (c) 2006-2009 Red Hat, Inc.
We had a bit of a dig around and there are a significant number of bugfix updates which address various segfaults, crashes, memory leaks etc. in this minor as well as subsequent minor versions. [5] [6]
We're trialling the Fedora 14 (fc14) RPMs for corosync and corosynclib (v1.4.2) to see if it fixes the particular issue we are seeing (i.e. whether or not the memory keeps spiralling way out of control).
Has anyone else seen an issue like this, and is there any known way to debug or fix it? If we can assist debugging by providing further information, please specify what this is (and, if non-obvious, how to get it).
Thanks again for your help
Chris
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster