On Wed, 2008-04-23 at 11:33 +0800, Ben J wrote: > What we have seen happening, is that we have the cluster operational for > several days and when initiating a reboot of one of the standby nodes > (that isn't running any clustered services at the time), the other > cluster nodes start filling the logs with: > > Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64 > Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65 > > With the generation number increasing until CMAN dies with: > > Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts - > will die > Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster. > Inconsistent cluster view ^^^^ This is the problem. vvvv These are all caused by that problem, and will go away when the above is resolved. > Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined > Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined > Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down > uncleanly > Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown. > Attemping to reconnect... > <snip>... > Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect: > Invalid request descriptor > Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate. Refusing > connection. > The interesting thing is that immediately after rebooting all of the > nodes within the cluster and restarting the cluster services, the > problem cannot be replicated. Typically the cluster system has to have > been running for 3-4 days untouched before we can then replicate the > problem again (i.e. I reboot one of the standby nodes and it fails again). > > I made a change yesterday to cluster.conf to increase the logging > facility and logging level (set it to debug level - 7) and after using > ccs_tool to apply the changes to the cluster online, once again I can't > replicate the problem (even though immediately before this I could > replicate the problem). On RHEL4, there's some ugly arcane thing you need to do after this: cman_tool version -r <new_config_version> I'm not sure this is the cause of the 'too many transitions' problem you hit. (Unfortunately, I'm not one of the people who fully understands what causes 'too many transitions'...) -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster