Re: Node is randomly fenced

"Schaefer, Micah" <Micah.Schaefer@xxxxxxxxxx> · Thu, 12 Jun 2014 11:32:57 -0400

Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and
fenced, then node3 was fenced when node4 came back online. The network
topology is as follows:
switch1: node1, node3 (two connections)
switch2: node2, node4 (two connections)
switch1 <―> switch2
All on the same subnet

I set up monitoring at 100 millisecond of the nics in active-backup mode,
and saw no messages about link problems before the fence.

I see multicast between the servers using tcpdump.

Any more ideas? 

On 6/12/14, 12:19 AM, "Digimer" <lists@xxxxxxxxxx> wrote:

>I considered that, but I would expect more nodes to be lost.
>
>On 12/06/14 12:12 AM, Netravali, Ganesh wrote:
>> Make sure multicast is enabled across the switches.
>>
>> -----Original Message-----
>> From: linux-cluster-bounces@xxxxxxxxxx
>>[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Schaefer, Micah
>> Sent: Thursday, June 12, 2014 1:20 AM
>> To: linux clustering
>> Subject: Re:  Node is randomly fenced
>>
>> Okay, I set up active/ backup bonding and will watch for any change.
>>
>> This is the network side:
>>       0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored
>>       0 output errors, 0 collisions, 0 interface resets
>>
>>
>>
>> This is the server side:
>>
>> em1       Link encap:Ethernet  HWaddr C8:1F:66:EB:46:FD
>>            inet addr:x.x.x.x  Bcast:x.x.x.255  Mask:255.255.255.0
>>            inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link
>>            UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>            RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0
>>            TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0
>>            collisions:0 txqueuelen:1000
>>            RX bytes:18866207931 (17.5 GiB)  TX bytes:1135415651 (1.0
>>GiB)
>>            Interrupt:34 Memory:d5000000-d57fffff
>>
>>
>>
>> I need to run some fiber, but for now two nodes are plugged into one
>>switch and the other two nodes into a separate switch that are on the
>>same subnet. I'll work on cross connecting the bonded interfaces to
>>different switches.
>>
>>
>>
>> On 6/11/14, 3:28 PM, "Digimer" <lists@xxxxxxxxxx> wrote:
>>
>>> The first thing I would do is get a second NIC and configure
>>> active-passive bonding. network issues are too common to ignore in HA
>>> setups. Ideally, I would span the links across separate stacked
>>>switches.
>>>
>>> As for debugging the issue, I can only recommend to look closely at the
>>> system and switch logs for clues.
>>>
>>> On 11/06/14 02:55 PM, Schaefer, Micah wrote:
>>>> I have the issue on two of my nodes. Each node has 1ea 10gb
>>>>connection.
>>>> No
>>>> bonding, single link. What else can I look at? I manage the network
>>>> too. I  don¹t see any link down notifications, don¹t see any errors on
>>>> the ports.
>>>>
>>>>
>>>>
>>>>
>>>> On 6/11/14, 2:29 PM, "Digimer" <lists@xxxxxxxxxx> wrote:
>>>>
>>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote:
>>>>>> It failed again, even after deleting all the other failover domains.
>>>>>>
>>>>>> Cluster conf
>>>>>> http://pastebin.com/jUXkwKS4
>>>>>>
>>>>>> I turned corosync output to debug. How can I go about
>>>>>> troubleshooting if  it really is a network issue or something else?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11
>>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new
>>>>>> configuration.
>>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29
>>>>>> corosync [TOTEM ] A processor joined or left the membership and a
>>>>>> new membership was formed.
>>>>>> Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
>>>>>> ip(10.70.100.101) ; members(old:4 left:1)
>>>>>
>>>>> This is, to me, *strongly* indicative of a network issue. It's not
>>>>> likely switch-wide as only one member was lost, but I would
>>>>> certainly put my money on a network problem somewhere, some how.
>>>>>
>>>>> Do you use bonding?
>>>>>
>>>>> --
>>>>> Digimer
>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for
>>>>> cancer is trapped in the mind of a person without access to
>>>>> education?
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster@xxxxxxxxxx
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>
>>>
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer
>>> is trapped in the mind of a person without access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster@xxxxxxxxxx
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster@xxxxxxxxxx
>https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster