Hello everyone, I experience a strange phenomenon on one of our RHCS clusters. During a scheduled downtime I needed to run a few cluster tests where I also fenced the node (by issuing a "fence_node barosic" from the other node of this two-node cluster) which now is causing me some pain because it is unwilling to start any service even when explicitly told so by the "-m" option of e.g. clusvcadm command. It appears to me as if the communication to the clurgmgrd on this node is disrupted although the daemon is running. This can also be seen from the incomplete output of clustat when compared to that of the fully integrated cluster node (i.e. arubaic in this case). At the moment I'm not allowed to issue a service relocation to show the resulting output because I require a scheduled downtime for this. All I can issue now are commands that don't affect the running services. Here's clustat's output on the "working node": (in accordance with the customer I froze all services to counter any unwanted mangling by clurgmgrd because we aren't HA in the current situation anyway) [root@aruba:~] # clustat Cluster Status for rhcs-voebb @ Wed Aug 31 09:43:10 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ arubaic 1 Online, Local, RG-Master barosic 2 Online Service Name Owner (Last) State ------- ---- ----- ------ ----- service:alma arubaic started [Z] service:lola arubaic started [Z] service:vb_bz_zlb arubaic started [Z] Whereas the same command issued on the reluctant node I get this: [root@baros:~] # clustat Cluster Status for rhcs-voebb @ Wed Aug 31 09:44:46 2011 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ arubaic 1 Online barosic 2 Online, Local I monitor our RHCS clusters through Nagios and defined a check_multi command to this end that checks what I deemed the vital functions of the RHCS cluster stack. Its OK output also shows me that all the required daemons are all running on barosic. Here's the output of this check run on barosic: [nagios@baros:~] $ /usr/lib64/nagios/plugins/contrib/check_multi/libexec/check_multi -l /usr/lib64/nagios/plugins -f /etc/nagios/check_multi/rhcs_status.cmd OK - 20 plugins checked, 20 ok [ 1] proc_ccsd PROCS OK: 1 process with command name 'ccsd' [ 2] proc_clurgmgrd PROCS OK: 2 processes with command name 'clurgmgrd' [ 3] proc_fenced PROCS OK: 1 process with command name 'fenced' [ 4] proc_groupd PROCS OK: 1 process with command name 'groupd' [ 5] proc_clvmd PROCS OK: 1 process with command name 'clvmd' [ 6] proc_gfs_controld PROCS OK: 1 process with command name 'clvmd' [ 7] proc_dlm_controld PROCS OK: 1 process with command name 'clvmd' [ 8] ic_node_ip 192.168.5.58 [ 9] ic_bond_dev bond1 [10] ic_mii_status up [11] ic_slave1 eth1 [12] ic_slave2 eth4 [13] slave1_props 8000Mb/s Full yes [14] slave2_props 8000Mb/s Full yes [15] slave1_link yes [16] slave2_link yes [17] slave1_speed 8000 [18] slave2_speed 8000 [19] slave1_mode full [20] slave2_mode full|check_multi::check_multi::plugins=20 time=0.257608 Also cman_tool reports all being OK with barosic (if I interpreted its output correctly). Yet, I'm not able to relocate any of the three services on barosic. What could be going wrong/missing, where else to look? Regards Ralph [root@baros:~] # cman_tool status Version: 6.2.0 Config Version: 64 Cluster Name: rhcs-voebb Cluster Id: 44402 Cluster Member: Yes Cluster Generation: 516 Membership state: Cluster-Member Nodes: 2 Expected votes: 1 Total votes: 2 Quorum: 1 Active subsystems: 9 Flags: 2node Dirty Ports Bound: 0 11 Node name: barosic Node ID: 2 Multicast addresses: 239.192.173.32 Node addresses: 192.168.5.58 [root@baros:~] # cman_tool nodes Node Sts Inc Joined Name 1 M 516 2011-08-28 19:27:38 arubaic 2 M 512 2011-08-28 19:27:38 barosic [root@baros:~] # cman_tool services type level name id state fence 0 default 00010001 none [1 2] dlm 1 clvmd 00020001 none [1 2] dlm 1 rgmanager 00010002 none [1 2] -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster