Hi people. I had this problem in spring last year while configuring one RH cluster for local telco. RH tehnical support was not very useful. They told me this is not a bug and so on ... So I will like to ask this here on RH cluster list, in hope for better advice. When I have 2-node cluster with RSA II management cards (fence_rsa agent) configured to have 1 oracle database in failover together with VIP adress and 5 luns shared from EMC storage, how can I pass one simple test with pooling out main data ethernet cables from active node? Let's say that I have interface bond0 (data subnet/vlan) and bond1 (fence subnet/vlan) on each node. Our customers (and we also, it is logical) are expecting if we pull out all two data cables from bond0 that inactive node will kill/fence active node and take over it's services. Unfortunately, what we see almost every time on acceptance test is that two nodes are killing each other no matter if they have or does not have a link. Here is fragment from /var/adm/messages on the active node when I disable bond0 (by pooling out cables): --------------------------------------------------------------------- Jan 9 14:05:43 north clurgmgrd: [4593]: <warning> Link for bond0: Not detected Jan 9 14:05:43 north clurgmgrd: [4593]: <warning> No link on bond0... Jan 9 14:05:43 north clurgmgrd[4593]: <notice> status on ip "10.156.10.32/26" returned 1 (generic error) Jan 9 14:05:43 north clurgmgrd[4593]: <notice> Stopping service ora_PROD Jan 9 14:05:53 north kernel: CMAN: removing node south from the cluster : Missed too many heartbeats Jan 9 14:05:53 north fenced[4063]: north not a cluster member after 0 sec post_fail_delay Jan 9 14:05:53 north fenced[4063]: fencing node "south" Jan 9 14:05:55 north shutdown: shutting down for system halt Jan 9 14:05:55 north init: Switching to runlevel: 0 Jan 9 14:05:55 north login(pam_unix)[4599]: session closed for user root Jan 9 14:05:56 north rgmanager: [4270]: <notice> Shutting down Cluster Service Manager... Jan 9 14:05:56 north clurgmgrd[4593]: <notice> Shutting down Jan 9 14:05:56 north fenced[4063]: fence "south" success [...] Jan 9 14:11:19 north syslogd 1.4.1: restart. ---------------------------------------------------------- As we see here, clurgmgrd(8) on node "north" has DETECTED that there is no link, it began to stop service "ora_PROD", system goes in shutdown. So far, so good. But then, fenced(8) daemon decides to fence "south" node (healthy node which has data link and all presupositions to take over ora_PROD service (oracle + IP + 5 ext3 FS's from EMC storage)! Why? Of course, south also is fenceing north, and I then have tragicomic situation where both nodes are beeing rebooted by eacs other. How can I prevent this? This looks like a bug. I don't want fenced to fence other node south if it already "knows" that it is the one without link. What to do? We cannot pass acceptance tests with such cluster state. :-( Thanks for any advice ... -- Miroslav Zubcic, Nimium d.o.o., email: <mvz@xxxxxxxxx> Tel: +385 01 4852 639, Fax: +385 01 4852 640, Mobile: +385 098 942 8672 Mrazoviceva 12, 10000 Zagreb, Hrvatska -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster