Fencing deadlock under Cluster Suite v4, how to solve?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all,

I'm having a strange problem. Here is the scenario:
* 2-node GFS cluster on 2 Dell PE-2900 servers;
* 1 Dell|EMC CX300 storage, with servers direct attached using two HBAs each;
* RHEL AS 4 Update 4, no updates applied;
* Red Hat Cluster Suite v4 Update 4, no updates applied;
* Red Hat GFS Update 4, no updates applied;
* Using IPMI over LAN fencing.

The Cluster was configured quite straight forward, the GFS filesystems worked fine.

Since the Dell PowerEdge x9xx series now support IPMI on both LOMs (onboard NICs) as a configurable failover option, we decided to "channel bond" eth0 and eth1 (onboard NICs) together to have both the normal network traffic and also the heartbeat traffic over a redundant channel (bond0). Since IPMI works over both NICs, fencing is expected to work even if one of the NICs/cables goes down.

Now the problem: whenever I pull both cables from one server, the servers almost simultaneously detect each other as offline (the logs show "serverX lost too many heartbeats, removing it from the Cluster"). A few seconds later and one server fences the other, at the same time!!!

As far as I can tell, there is some delay between the sending of the "power off" IPMI command and the real poweroff from the IPMI embedded controller. By the way, there is no "normal shutdown" caused by ACPI or APM, these are both turned off in the servers.

So it seems that when the first server kills the other, there is enough time to the second server to send the IPMI command to kill the first server also, and a few seconds later both are turned off, so my redundant environment goes down alltogether.

Question: does someone is aware of a solution for this? Is there a way a server can notify the other that it is removing it from the cluster? Maybe using a shared disk? By the way, I didn't experimented with the new shared disk feature under CS v4, only with CS v3.

Thank you all in advance.

Regards,

Celso.
--
*Celso Kopp Webber*

--
Esta mensagem foi verificada pelo sistema de antivírus e
acredita-se estar livre de perigo.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux