On Thu, 17 Apr 2008, Andrew Lacey wrote:
I am doing some testing on a 2-node, active/standby RHEL 4 cluster with non-GFS shared storage. I am using HP iLO for fencing. I don't have a quorum disk set up. Both cluster nodes are connected to the same switch, and that network path is used for cluster communication as well as general network communication (including access to iLO). I've found that when the switch goes down and comes back up, the result is not desirable. As soon as the switch loses power, each node starts trying to fence the other. Since the iLO is not reachable, this is unsuccessful, but the nodes keep retrying the fence. When the switch comes back online, the "OK Corral" scenario takes place -- both nodes fence each other simultaneously and bring down the cluster.
I had a similar issue, but the solution I went for is doctoring the fencing agent to put in a delay based on node's priority in to the fencing daemon. That way the nodes wouldn't try to fence simultaneously, but in a staggered fashion.
If you have a spare NIC, and the nodes are next to each other, you could make them use a cross-over cable for their cluster communication, so they would notice that they are both still up even when the switch dies. That's what I do.
Gordan -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster