On Thu, 04 Oct 2007 10:35:13 -0400, Lon Hohberger wrote: > > What is the correct behaviour? Shouldn't my Cluster come up because I have > > two votes active? In this case each node counts one vote in the cluster, and > > the quorum counts another one. > cman_tool status / cman_tool nodes output would be helpful > Also, which version of cman do you have? > -- Lon Hi Lon, Here are some relevant information from the Cluster: ** What is happening: If I boot node1 with node2 powered off, it stops for 5 minutes during the start of ccsd, and after that it regains quorum, qdiskd starts successfully, but fenced keeps trying to start for 2 minutes and then it gives up with a "failed" message. ** Relevant log messages collected after boot: Oct 4 11:51:13 hercules01 kernel: CMAN: Waiting to join or form a Linux- cluster Oct 4 11:51:13 hercules01 ccsd[9144]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4 Oct 4 11:51:13 hercules01 ccsd[9144]: Initial status:: Inquorate Oct 4 11:51:45 hercules01 kernel: CMAN: forming a new cluster Oct 4 11:56:45 hercules01 cman: Timed-out waiting for cluster failed ^^^^^^^^ 5 minutes later Oct 4 11:56:45 hercules01 lock_gulmd: no <gulm> section detected in /etc/cluster/cluster.conf succeeded Oct 4 11:56:45 hercules01 qdiskd: Starting the Quorum Disk Daemon: succeeded Oct 4 11:57:02 hercules01 kernel: CMAN: quorum regained, resuming activity Oct 4 11:57:02 hercules01 ccsd[9144]: Cluster is quorate. Allowing connections. Oct 4 11:58:45 hercules01 fenced: startup failed ^^^^^^^^ exactly 2 minutes after the qdiskd message above, I've noticed that fenced is started in the init scripts with "fence_tool -t 120 join -w" Oct 4 11:59:38 hercules01 rgmanager: clurgmgrd startup failed ^^^^^^^^ after other service boot up ok, rgmanager fails to boot, probably because fenced failed to start Oct 4 11:56:45 hercules01 qdiskd[9292]: <info> Quorum Daemon Initializing Oct 4 11:56:55 hercules01 qdiskd[9292]: <info> Initial score 1/1 Oct 4 11:56:55 hercules01 qdiskd[9292]: <info> Initialization complete Oct 4 11:56:55 hercules01 qdiskd[9292]: <notice> Score sufficient for master operation (1/1; required=1); upgrading Oct 4 11:57:01 hercules01 qdiskd[9292]: <info> Assuming master role Oct 4 11:59:08 hercules01 clurgmgrd[10548]: <notice> Resource Group Manager Starting Oct 4 11:59:08 hercules01 clurgmgrd[10548]: <info> Loading Service Data Oct 4 11:59:08 hercules01 clurgmgrd[10548]: <info> Initializing Services ... <messages of stopping the services and making sure filesystems are unmounted> Oct 4 11:59:28 hercules01 clurgmgrd[10548]: <info> Services Initialized --- no more cluster messages after this point --- ** Daemons status: # service fenced status fenced (pid 9304) is running... # service rgmanager status clurgmgrd (pid 10548 10547) is running... ** Clustat: < delay of about 10 seconds > Timed out waiting for a response from Resource Group Manager Member Status: Quorate Resource Group Manager not running; no service information available. Member Name Status ------ ---- ------ node1 Online, Local node2 Offline ** cman_tool nodes Node Votes Exp Sts Name 0 1 0 M /dev/emcpowere1 1 1 3 M node1 ** cman_tool status Protocol version: 5.0.1 Config version: 12 Cluster name: clu_prosperdb Cluster ID: 570 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 3 Total_votes: 2 Quorum: 2 Active subsystems: 2 Node name: node1 Node ID: 1 Node addresses: 192.168.50.1 ** Kernel version (uname -r): RHEL4u4 with latest kernel approved by EMC, the EMC eLab was done for RHEL4U4, not RHEL 4.5, so we can't upgrade the kernel, unless we move on everything to RHEL 4.5: 2.6.9-42.0.10.ELsmp ** Installed cluster package versions (same on both nodes): ccs-1.0.10-0.x86_64.rpm cman-1.0.17-0.x86_64.rpm cman-kernel-smp-2.6.9-45.15.x86_64.rpm dlm-1.0.3-1.x86_64.rpm dlm-kernel-smp-2.6.9-44.9.x86_64.rpm fence-1.32.45-1.0.2.x86_64.rpm gulm-1.0.10-0.x86_64.rpm iddev-2.0.0-4.x86_64.rpm magma-1.0.7-1.x86_64.rpm magma-plugins-1.0.12-0.x86_64.rpm perl-Net-Telnet-3.03-3.noarch.rpm rgmanager-1.9.68-1.x86_64.rpm system-config-cluster-1.0.45-1.0.noarch.rpm ** What happens if I boot up the other node (node2): - ccsd comes up after just a few seconds on node2 - all other cluster daemons start successfully - fenced and rgmanager on node1 both start - the logs show node1 starting services when node2 came up: Oct 4 12:51:44 hercules01 clurgmgrd[10548]: <info> Logged in SG "usrm::manager" Oct 4 12:51:44 hercules01 clurgmgrd[10548]: <info> Magma Event: Membership Change Oct 4 12:51:44 hercules01 clurgmgrd[10548]: <info> State change: Local UP ... <messages about services starting and filesystems being mounted> Oct 4 12:52:24 hercules01 clurgmgrd[10548]: <info> Magma Event: Membership Change Oct 4 12:52:24 hercules01 clurgmgrd[10548]: <info> State change: node2 UP The only packages not up to date are the kernel related ones, which I believe are the correct ones for my kernel version. Please, tell me if you see any mistake on this setup. The problem is that the customer cannot boot up the systems if one node is eventually dead. If both nodes are up and one goes down, the functionality is OK. But as it is now, if the remaining node reboots, the services cannot come up. Thank you very much. Regards, -- Celso -- Esta mensagem foi verificada pelo sistema de antivírus e acredita-se estar livre de perigo. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster