Hi, I have a problem with RHCS 4 in two node configuration (wayne and eastwood). Service ucpgw is running on wayne and httpd on eastwood. Every node is a Sun V40z server, fencing is done by IPMI. During cluster tests I unplug both power cables from one server (wayne), thus simulating unexpected poweroff (IPMI interface is also unavailable while server is out of power). Unfortunately cluster (with only one node active) is not able to start the missing service. Node wayne is able to start both services only after a reboot. Clustat info before tests: Member Status: Quorate Member Name Status ------ ---- ------ wayne Online, rgmanager eastwood Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- ucpgw wayne started httpd eastwood started Clustat info after tests (clustat waits about 10 seconds on a timeout before displaying this message): Timed out waiting for a response from Resource Group Manager Member Status: Quorate Member Name Status ------ ---- ------ wayne Offline eastwood Online, Local, rgmanager Here are the logs from running node: May 29 08:08:24 eastwood clurgmgrd: [3252]: <info> Executing /etc/init.d/httpd status May 29 08:08:37 eastwood kernel: CMAN: removing node wayne from the cluster : Missed too many heartbeats May 29 08:09:01 eastwood crond(pam_unix)[19600]: session opened for user root by (uid=0) May 29 08:09:01 eastwood crond(pam_unix)[19600]: session closed for user root May 29 08:10:01 eastwood crond(pam_unix)[19610]: session opened for user root by (uid=0) May 29 08:10:01 eastwood crond(pam_unix)[19610]: session closed for user root May 29 08:10:07 eastwood fenced[2854]: wayne not a cluster member after 90 sec post_fail_delay May 29 08:10:07 eastwood fenced[2854]: fencing node "wayne" May 29 08:11:01 eastwood crond(pam_unix)[19623]: session opened for user root by (uid=0) May 29 08:11:01 eastwood crond(pam_unix)[19623]: session closed for user root May 29 08:11:51 eastwood ntpd[2895]: can't open /var/ntp/ntp.drift.TEMP: No such file or directory May 29 08:12:01 eastwood crond(pam_unix)[19640]: session opened for user root by (uid=0) May 29 08:12:01 eastwood crond(pam_unix)[19640]: session closed for user root May 29 08:12:27 eastwood fenced[2854]: agent "fence_ipmilan" reports: Rebooting machine @ IPMI:10.12.83.177...ipmilan: Failed to connect after 30 seconds Failed May 29 08:12:27 eastwood fenced[2854]: fence "wayne" failed May 29 08:12:32 eastwood fenced[2854]: fencing node "wayne" May 29 08:12:32 eastwood ccsd[2751]: process_get: Invalid connection descriptor received. May 29 08:12:32 eastwood ccsd[2751]: Error while processing get: Invalid request descriptor May 29 08:12:32 eastwood fenced[2854]: fence "wayne" failed May 29 08:12:37 eastwood fenced[2854]: fencing node "wayne" May 29 08:12:37 eastwood ccsd[2751]: process_get: Invalid connection descriptor received. May 29 08:12:37 eastwood ccsd[2751]: Error while processing get: Invalid request descriptor May 29 08:12:37 eastwood fenced[2854]: fence "wayne" failed May 29 08:12:42 eastwood fenced[2854]: fencing node "wayne" May 29 08:12:42 eastwood ccsd[2751]: process_get: Invalid connection descriptor received. May 29 08:12:42 eastwood ccsd[2751]: Error while processing get: Invalid request descriptor May 29 08:12:42 eastwood fenced[2854]: fence "wayne" failed May 29 08:12:47 eastwood fenced[2854]: fencing node "wayne" May 29 08:12:47 eastwood ccsd[2751]: process_get: Invalid connection descriptor received. May 29 08:12:47 eastwood ccsd[2751]: Error while processing get: Invalid request descriptor May 29 08:12:47 eastwood fenced[2854]: fence "wayne" failed May 29 08:12:52 eastwood fenced[2854]: fencing node "wayne" May 29 08:12:52 eastwood ccsd[2751]: process_get: Invalid connection descriptor received. May 29 08:12:52 eastwood ccsd[2751]: Error while processing get: Invalid request descriptor May 29 08:12:52 eastwood fenced[2854]: fence "wayne" failed May 29 08:12:57 eastwood fenced[2854]: fencing node "wayne" May 29 08:12:57 eastwood ccsd[2751]: process_get: Invalid connection descriptor received. May 29 08:12:57 eastwood ccsd[2751]: Error while processing get: Invalid request descriptor Here's my cluster config: <?xml version="1.0"?> <cluster config_version="75" name="ucp_cluster"> <fence_daemon post_fail_delay="90" post_join_delay="30"/> <clusternodes> <clusternode name="wayne" votes="1"> <multicast addr="224.0.0.1" interface="bond0"/> <fence> <method name="1"> <device name="ipmi-wayne"/> </method> </fence> </clusternode> <clusternode name="eastwood" votes="1"> <multicast addr="224.0.0.1" interface="bond0"/> <fence> <method name="1"> <device name="ipmi-eastwood"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"> <multicast addr="224.0.0.1"/> </cman> <fencedevices> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="10.12.83.176" login="""" name="ipmi-eastwood" passwd="**********"/> <fencedevice agent="fence_ipmilan" auth="none" ipaddr="10.12.83.177" login="""" name="ipmi-wayne" passwd="**********"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="ucp" ordered="1" restricted="1"> <failoverdomainnode name="wayne" priority="1"/> <failoverdomainnode name="eastwood" priority="1"/> </failoverdomain> </failoverdomains> <resources> <ip address="10.12.17.135" interface="bond0" monitor_link="0"/> <ip address="10.12.17.136" monitor_link="1"/> <fs device="/dev/ucpsmslogvg/ucpsmslog" force_fsck="0" force_unmount="1" fsid="38769" fstype="ext3" mountpoint="/ucpsmslog" name="ucpsmslog" options="" self_fence="1"/> <fs device="/dev/ucpgwlogvg/ucpgwlog" force_fsck="0" force_unmount="1" fsid="39307" fstype="ext3" mountpoint="/ucpgwlog" name="ucpgwlog" options="" self_fence="1"/> <script file="/home/ucpgw/bin/ucpgw" name="ucpgw"/> <script file="/etc/init.d/httpd" name="httpd"/> </resources> <service autostart="1" domain="ucp" name="ucpgw" recovery="relocate"> <script ref="ucpgw"/> <fs ref="ucpgwlog"/> <ip ref="10.12.17.136"/> </service> <service autostart="1" domain="ucp" name="httpd" recovery="relocate"> <ip ref="10.12.17.135"/> <script ref="httpd"/> <fs ref="ucpsmslog"/> </service> </rm> </cluster> Is this cluster misconfigured or is it a bug in fenced/ccsd subsystem? How can I solve this problem? Regards, Tomasz Koczorowski -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster