cluster instability

"Shawn Hood" <shawnlhood@xxxxxxxxx> · Mon, 16 Jun 2008 11:54:36 -0400

All,

This message was sent out to my office, so the voice may seem a bit
odd.  We have a 4 node cluster running RHEL4U6 on Dell Poweredge
1950s.  Fencing is done via DRAC.

Using packages (from RHN):

cman-kernel-smp-2.6.9-53.13
cman-1.0.17-0.el4_6.5
ccs-1.0.11-1.el4_6.1
fence-1.32.50-2.el4_6.1
lvm2-cluster-2.02.27-2.el4_6.2
dlm-kernel-smp-2.6.9-52.9
dlm-kernheaders-2.6.9-52.9

Our cluster became unstable on Saturday morning.  Apparently
hugin stopped sending out heartbeats, causing it to become fenced.  hugin
was under heavy load (~10) at the time:

03:30:02 AM         6       453      9.35     10.29     10.51
03:40:01 AM        12       465     11.02     11.00     10.75
03:50:02 AM         3       446      9.75     10.80     10.86
04:00:01 AM         5       430      9.23      9.47     10.07
Average:            7       455     10.19     10.32     10.28

04:09:35 AM       LINUX RESTART

As you can see, hugin was fenced at 4:09.  The other nodes then began
logging the following:

Jun 14 04:08:06 munin kernel: CMAN: Initiating transition, generation 58
Jun 14 04:08:21 munin kernel: CMAN: Initiating transition, generation 59
Jun 14 04:08:36 munin kernel: CMAN: Initiating transition, generation 60
Jun 14 04:08:51 munin kernel: CMAN: Initiating transition, generation 61
Jun 14 04:09:06 munin kernel: CMAN: too many transition restarts - will die
Jun 14 04:09:06 munin kernel: CMAN: we are leaving the cluster. Inconsistent
cluster view

After so many 'initiating transition' messages, the cluster died.  Our
network utilization was very low at the time.

Any ideas?

Shawn

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster