Hello, I think I've figured out why this is happening. I've opened bz219633 in order to track this issue if you would like to subscribe to it. I should have something usefull for you to test today. Josef On Thu, Dec 14, 2006 at 03:27:16PM +0000, Marcos David wrote: > Sure. > If you can get me the package, I can install it on our testing > environment and provide you with the results. > > Greets, > Marcos David > > > Josef Whiter wrote: > >Hello, > > > >I know somebody else seeing this problem as well. Its a problem with ccsd, > >fenced goes to do a ccs_get() to get the next fence method and it fails > >because > >ccsd doesn't have an open connection struct to processes that request. I'm > >getting ready to build a debug ccs package for the other individual > >experiencing > >this problem, would you be willing to run it as well and provide feedback? > >Thank you, > > > >Josef > > > >On Thu, Dec 14, 2006 at 03:19:21PM +0000, Marcos David wrote: > > > >>Hello, > >>I still need help with this one ;) > >> > >>help! please! > >> > >>Thanks. > >> > >>Marcos David wrote: > >> > >>>hello, > >>>I'm experiencing some problems with cluster fencing. > >>>First lets start with the specs: > >>> > >>>it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4 > >>> > >>>the machines both have ILOM device that acts as a first level of fencing. > >>>then there is a second level of fencing that is performed by an UPS. > >>> > >>>my problem is the following: > >>>if i shutdown one of the nodes (simulating a power failure) the other > >>>tries to fence the failed node. So far so good. > >>>The problem is that since the ILOM in the node is offline the second > >>>node keeps trying to fence the ILOM device and never gives up! > >>> > >>>According to what I've read on the FAQ about fencing levels, if the > >>>first level fails it should go to the second level, and so on... > >>> > >>>But it never does this! > >>> > >>>Here a copy of th /var/log/messages: > >>> > >>>Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the > >>>cluster : Missed too many heartbeats > >>>Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after > >>>0 sec post_fail_delay > >>>Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a" > >>>Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: > >>>Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect > >>>after 30 seconds Failed > >>>Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection > >>>descriptor received. > >>>Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid > >>>request descriptor > >>>Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed > >>>Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a" > >>> > >>>the last 4 lines repeat for ever.... > >>> > >>>here is a copy of the cluster.conf > >>> > >>> > >>><?xml version="1.0"?> > >>><cluster config_version="19" name="SERVER-A"> > >>> <fence_daemon post_fail_delay="0" post_join_delay="3"/> > >>> <clusternodes> > >>> <clusternode name="node-a" votes="1"> > >>> <fence> > >>> <method name="1"> > >>> <device name="fence_node-a"/> > >>> </method> > >>> <method name="2"> > >>> <device name="UPS_node-a"/> > >>> </method> > >>> </fence> > >>> </clusternode> > >>> <clusternode name="node-b" votes="1"> > >>> <fence> > >>> <method name="1"> > >>> <device name="fence_node-b"/> > >>> </method> > >>> <method name="2"> > >>> <device name="UPS_node-b"/> > >>> </method> > >>> </fence> > >>> </clusternode> > >>> </clusternodes> > >>> <cman expected_votes="1" two_node="1"/> > >>> <fencedevices> > >>> <fencedevice agent="fence_ipmilan" auth="password" > >>>ipaddr="172.18.57.17" login="root" name="fence_node-a" > >>>passwd="changeme"/> > >>> <fencedevice agent="fence_ipmilan" auth="password" > >>>ipaddr="172.18.57.18" login="root" name="fence_node-b" > >>>passwd="changeme"/> > >>> <fencedevice agent="fence_apc" ipaddr="172.18.57.20" > >>>login="power" name="UPS_node-a" passwd="power"/> > >>> <fencedevice agent="fence_apc" ipaddr="172.18.57.21" > >>>login="power" name="UPS_node-b" passwd="power"/> > >>> > >>> </fencedevices> > >>> <rm> > >>> <failoverdomains> > >>> <failoverdomain name="Cluster_0" ordered="1" > >>>restricted="0"> > >>> <failoverdomainnode name="node-a" > >>>priority="1"/> > >>> <failoverdomainnode name="node-b" > >>>priority="1"/> > >>> </failoverdomain> > >>> </failoverdomains> > >>> <resources> > >>> <fs device="/dev/sdb1" force_fsck="1" > >>>force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" > >>>name="Storedge_Shared" options="" self_fence="1"/> > >>> <ip address="172.18.57.16" monitor_link="1"/> > >>> <ip address="172.18.57.11" monitor_link="1"/> > >>> <ip address="172.18.57.14" monitor_link="1"/> > >>> </resources> > >>> <service autostart="1" domain="Cluster_0" > >>>name="postgresql"> > >>> <ip ref="172.18.57.16"> > >>> <fs ref="Storedge_Shared"> > >>> <script > >>>file="/etc/init.d/postgresql" > >>>name="PostgreSQL"> > >>> </fs> > >>> </ip> > >>> </service> > >>> <service autostart="1" domain="Cluster_0" name="afs"> > >>> <ip ref="172.18.57.14"> > >>> <script file="/etc/init.d/afs" > >>>name="AFS"/> > >>> </ip> > >>> </service> > >>> </rm> > >>></cluster> > >>> > >>>I would like to know a way to solve this problem.... :-) > >>> > >>>Thanks in advance, > >>> > >>>Marcos David > >>> > >>> > >>> > >>> > >>>-- > >>>Linux-cluster mailing list > >>>Linux-cluster@xxxxxxxxxx > >>>https://www.redhat.com/mailman/listinfo/linux-cluster > >>> > >>> > >>-- > >>Linux-cluster mailing list > >>Linux-cluster@xxxxxxxxxx > >>https://www.redhat.com/mailman/listinfo/linux-cluster > >> > > > >-- > >Linux-cluster mailing list > >Linux-cluster@xxxxxxxxxx > >https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster