Re: cluster instability

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Tue, 17 Jun 2008 08:29:00 +0100

GS R wrote:

On 6/16/08, *Shawn Hood* <shawnlhood@xxxxxxxxx 
<mailto:shawnlhood@xxxxxxxxx>> wrote:

    All,

    This message was sent out to my office, so the voice may seem a bit
    odd.  We have a 4 node cluster running RHEL4U6 on Dell Poweredge
    1950s.  Fencing is done via DRAC.

    Using packages (from RHN):

    cman-kernel-smp-2.6.9-53.13
    cman-1.0.17-0.el4_6.5
    ccs-1.0.11-1.el4_6.1
    fence-1.32.50-2.el4_6.1
    lvm2-cluster-2.02.27-2.el4_6.2
    dlm-kernel-smp-2.6.9-52.9
    dlm-kernheaders-2.6.9-52.9

    Our cluster became unstable on Saturday morning.  Apparently
    hugin stopped sending out heartbeats, causing it to become
    fenced.  hugin
    was under heavy load (~10) at the time:

    03:30:02 AM         6       453      9.35     10.29     10.51
    03:40:01 AM        12       465     11.02     11.00     10.75
    03:50:02 AM         3       446      9.75     10.80     10.86
    04:00:01 AM         5       430      9.23      9.47     10.07
    Average:            7       455     10.19     10.32     10.28

    04:09:35 AM       LINUX RESTART

    As you can see, hugin was fenced at 4:09.  The other nodes then began
    logging the following:

    Jun 14 04:08:06 munin kernel: CMAN: Initiating transition, generation 58
    Jun 14 04:08:21 munin kernel: CMAN: Initiating transition, generation 59
    Jun 14 04:08:36 munin kernel: CMAN: Initiating transition, generation 60
    Jun 14 04:08:51 munin kernel: CMAN: Initiating transition, generation 61
    Jun 14 04:09:06 munin kernel: CMAN: too many transition restarts -
    will die
    Jun 14 04:09:06 munin kernel: CMAN: we are leaving the cluster.
    Inconsistent
    cluster view

I guess this has to do with network issue though its utilization was low 
when this logged.
The node is not able to receive messages.

I suspect you've hit this bug:

https://bugzilla.redhat.com/show_bug.cgi?id=444751

There's a patch in the bugzilla, and a workaround program you can run 
which should help if you can't upgrade the kernel module (See comment #10)

--

Chrissie

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster