Re: Node is randomly fenced

Digimer <lists@xxxxxxxxxx> · Thu, 12 Jun 2014 00:19:50 -0400

I considered that, but I would expect more nodes to be lost.

On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
Make sure multicast is enabled across the switches.

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Schaefer, Micah
Sent: Thursday, June 12, 2014 1:20 AM
To: linux clustering
Subject: Re:  Node is randomly fenced

Okay, I set up active/ backup bonding and will watch for any change.

This is the network side:
      0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
      0 output errors, 0 collisions, 0 interface resets

This is the server side:

em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
           inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
           inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
           RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
           TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0 GiB)
           Interrupt:34 Memory:d5000000-d57fffff

I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I'll work on cross connecting the bonded interfaces to different switches.

On 6/11/14, 3:28 PM, "Digimer" <lists@xxxxxxxxxx> wrote:

The first thing I would do is get a second NIC and configure
active-passive bonding. network issues are too common to ignore in HA
setups. Ideally, I would span the links across separate stacked switches.

As for debugging the issue, I can only recommend to look closely at the
system and switch logs for clues.

On 11/06/14 02:55 PM, Schaefer, Micah wrote:
I have the issue on two of my nodes. Each node has 1ea 10gb connection.
No
bonding, single link. What else can I look at? I manage the network
too. I  don¹t see any link down notifications, don¹t see any errors on
the ports.

On 6/11/14, 2:29 PM, "Digimer" <lists@xxxxxxxxxx> wrote:

On 11/06/14 02:21 PM, Schaefer, Micah wrote:
It failed again, even after deleting all the other failover domains.

Cluster conf
http://pastebin.com/jUXkwKS4

I turned corosync output to debug. How can I go about
troubleshooting if  it really is a network issue or something else?

Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
14:10:17 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
corosync [TOTEM ] A processor joined or left the membership and a
new membership was formed.
Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)

This is, to me, *strongly* indicative of a network issue. It's not
likely switch-wide as only one member was lost, but I would
certainly put my money on a network problem somewhere, some how.

Do you use bonding?

--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for
cancer is trapped in the mind of a person without access to
education?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Digimer
Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
is trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster