GS R wrote:
On 6/16/08, *Shawn Hood* <shawnlhood@xxxxxxxxx
<mailto:shawnlhood@xxxxxxxxx>> wrote:
All,
This message was sent out to my office, so the voice may seem a bit
odd. We have a 4 node cluster running RHEL4U6 on Dell Poweredge
1950s. Fencing is done via DRAC.
Using packages (from RHN):
cman-kernel-smp-2.6.9-53.13
cman-1.0.17-0.el4_6.5
ccs-1.0.11-1.el4_6.1
fence-1.32.50-2.el4_6.1
lvm2-cluster-2.02.27-2.el4_6.2
dlm-kernel-smp-2.6.9-52.9
dlm-kernheaders-2.6.9-52.9
Our cluster became unstable on Saturday morning. Apparently
hugin stopped sending out heartbeats, causing it to become
fenced. hugin
was under heavy load (~10) at the time:
03:30:02 AM 6 453 9.35 10.29 10.51
03:40:01 AM 12 465 11.02 11.00 10.75
03:50:02 AM 3 446 9.75 10.80 10.86
04:00:01 AM 5 430 9.23 9.47 10.07
Average: 7 455 10.19 10.32 10.28
04:09:35 AM LINUX RESTART
As you can see, hugin was fenced at 4:09. The other nodes then began
logging the following:
Jun 14 04:08:06 munin kernel: CMAN: Initiating transition, generation 58
Jun 14 04:08:21 munin kernel: CMAN: Initiating transition, generation 59
Jun 14 04:08:36 munin kernel: CMAN: Initiating transition, generation 60
Jun 14 04:08:51 munin kernel: CMAN: Initiating transition, generation 61
Jun 14 04:09:06 munin kernel: CMAN: too many transition restarts -
will die
Jun 14 04:09:06 munin kernel: CMAN: we are leaving the cluster.
Inconsistent
cluster view
I guess this has to do with network issue though its utilization was low
when this logged.
The node is not able to receive messages.
I suspect you've hit this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=444751
There's a patch in the bugzilla, and a workaround program you can run
which should help if you can't upgrade the kernel module (See comment #10)
--
Chrissie
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster