Hello all,
We've been encountering an issue with RHCS4 U6 (using the U5 version of
system-config-cluster as U6 version is broken) that results in the
cluster failing after rebooting one of the standby nodes with CMAN
dieing after too many transition restarts.
We have a 7 node cluster, with 5 active nodes and 2 standby nodes. We
are running the cluster with broadcast mode for cluster communication
(the default for CS4), changing to multicast isn't an option at the
moment due to us using Cisco switching infrastructure. The hardware
we're running the cluster on are IBM HS21 blades within 2 IBM H series
Bladechassis (3 within one chassis, 4 in another). Each Bladechassis
network switch module has dual gig uplinks to a Cisco switch.
We have done a lot of analysis of our network to ensure that the problem
is not being caused by the underlying network preventing the cluster
nodes from talking to one another, so we have ruled this out as a cause
of the problem.
The cluster is currently a pre-production system that we are testing
before putting into production so the nodes are basically sitting idle
at the moment whilst we test things (i.e. the cluster).
What we have seen happening, is that we have the cluster operational for
several days and when initiating a reboot of one of the standby nodes
(that isn't running any clustered services at the time), the other
cluster nodes start filling the logs with:
Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64
Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65
With the generation number increasing until CMAN dies with:
Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts -
will die
Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster.
Inconsistent cluster view
Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined
Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined
Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down
uncleanly
Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown.
Attemping to reconnect...
Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate. Refusing
connection.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing connect:
Connection refused
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something
evil.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid
request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something
evil.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid
request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-21).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something
evil.
Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect:
Invalid request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate. Refusing
connection.
The interesting thing is that immediately after rebooting all of the
nodes within the cluster and restarting the cluster services, the
problem cannot be replicated. Typically the cluster system has to have
been running for 3-4 days untouched before we can then replicate the
problem again (i.e. I reboot one of the standby nodes and it fails again).
I made a change yesterday to cluster.conf to increase the logging
facility and logging level (set it to debug level - 7) and after using
ccs_tool to apply the changes to the cluster online, once again I can't
replicate the problem (even though immediately before this I could
replicate the problem).
Has anyone experienced anything even remotely similar to this (I
couldn't see anything similar reported in the list archives) and/or have
any suggestions as to what might be causing the issue?
Cheers,
Ben
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster