Re: Repeated fencing

"Dirk H. Schulz" <dirk.schulz@xxxxxxxxxxxxx> · Wed, 24 Feb 2010 15:53:33 +0100

As far as I recall the docs say a switch is needed that is capable of 
IGMP etc. There must be issues with cross cables, if that is correct.

I did not investigate further and took an old cisco 2950 (you can by 
masses of used ones on ebay very cheap) and configured it according to 
docs.

To avoid the switch being the SPOF you can use 2 3750s with extended 
image, they can use LACP to bond links connected to both of them. That 
gets a bit more expensive then, admittedly, but even those should be 
available on ebay.

Dirk

Am 22.02.10 20:53, schrieb Doug Tucker:
We did.  It's problematic when you need to reboot a switch or it goes
down.  They can't talk and try to fence each other.  Crossover cable is
a direct connection, actually far more efficient for what you are trying
to accomplish.

On Mon, 2010-02-22 at 11:57 -0600, Paul M. Dyer wrote:

Crossover cable??????

With all the $$ spent, try putting a switch between the nodes.

Paul

----- Original Message -----
From: "Doug Tucker"<tuckerd@xxxxxxxxxxxx>
To: linux-cluster@xxxxxxxxxx
Sent: Monday, February 22, 2010 10:15:49 AM (GMT-0600) America/Chicago
Subject:  Repeated fencing

We have a 2 4.x cluster that has developed an issue we are unable to
resolve.  Starting back in December, the nodes began fencing each other
randomly, and as frequently as once a day.  There is nothing at the
console prior to it happening, and nothing in the logs.  We have not
been able to develop any pattern to this point, the 2 nodes appear to be
functioning fine, and suddenly in the logs a message will appear about
"node x missed too many heartbeats" and the next thing you see is it
fencing the node.  Thinking we possibly had a hardware issue, we
replaced both nodes from scratch with new machines, the problem
persists.  The cluster communication is done via a crossover cable on
eth1 on both devices with private ip's.  We have a 2nd cluster that is
not having this issue, and both nodes have been up for over 160 days.
The configuration is basically identical to the problematic cluster.
The only difference between the 2 now is the newer hardware on the
problematic node (prior, that was identical), and the kernel.  The
non-problematic cluster is still running kernel 89.0.9 and the
problematic cluster is on 89.0.11.  We are afraid at this point to allow
our non problematic cluster upgrade to the latest packages.  Any insight
or advice would be greatly appreciated, we have exhausted our ideas
here.

Sincerely,

Doug

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster