Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and fenced, then node3 was fenced when node4 came back online. The network topology is as follows: switch1: node1, node3 (two connections) switch2: node2, node4 (two connections) switch1 <―> switch2 All on the same subnet I set up monitoring at 100 millisecond of the nics in active-backup mode, and saw no messages about link problems before the fence. I see multicast between the servers using tcpdump. Any more ideas? On 6/12/14, 12:19 AM, "Digimer" <lists@xxxxxxxxxx> wrote: >I considered that, but I would expect more nodes to be lost. > >On 12/06/14 12:12 AM, Netravali, Ganesh wrote: >> Make sure multicast is enabled across the switches. >> >> -----Original Message----- >> From: linux-cluster-bounces@xxxxxxxxxx >>[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Schaefer, Micah >> Sent: Thursday, June 12, 2014 1:20 AM >> To: linux clustering >> Subject: Re: Node is randomly fenced >> >> Okay, I set up active/ backup bonding and will watch for any change. >> >> This is the network side: >> 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored >> 0 output errors, 0 collisions, 0 interface resets >> >> >> >> This is the server side: >> >> em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD >> inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 >> inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link >> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >> RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 >> TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 >> collisions:0 txqueuelen:1000 >> RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 >>GiB) >> Interrupt:34 Memory:d5000000-d57fffff >> >> >> >> I need to run some fiber, but for now two nodes are plugged into one >>switch and the other two nodes into a separate switch that are on the >>same subnet. I'll work on cross connecting the bonded interfaces to >>different switches. >> >> >> >> On 6/11/14, 3:28 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >> >>> The first thing I would do is get a second NIC and configure >>> active-passive bonding. network issues are too common to ignore in HA >>> setups. Ideally, I would span the links across separate stacked >>>switches. >>> >>> As for debugging the issue, I can only recommend to look closely at the >>> system and switch logs for clues. >>> >>> On 11/06/14 02:55 PM, Schaefer, Micah wrote: >>>> I have the issue on two of my nodes. Each node has 1ea 10gb >>>>connection. >>>> No >>>> bonding, single link. What else can I look at? I manage the network >>>> too. I don¹t see any link down notifications, don¹t see any errors on >>>> the ports. >>>> >>>> >>>> >>>> >>>> On 6/11/14, 2:29 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >>>> >>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>>>> It failed again, even after deleting all the other failover domains. >>>>>> >>>>>> Cluster conf >>>>>> http://pastebin.com/jUXkwKS4 >>>>>> >>>>>> I turned corosync output to debug. How can I go about >>>>>> troubleshooting if it really is a network issue or something else? >>>>>> >>>>>> >>>>>> >>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 >>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>>>> configuration. >>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 >>>>>> corosync [TOTEM ] A processor joined or left the membership and a >>>>>> new membership was formed. >>>>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>>>> ip(10.70.100.101) ; members(old:4 left:1) >>>>> >>>>> This is, to me, *strongly* indicative of a network issue. It's not >>>>> likely switch-wide as only one member was lost, but I would >>>>> certainly put my money on a network problem somewhere, some how. >>>>> >>>>> Do you use bonding? >>>>> >>>>> -- >>>>> Digimer >>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>>>> cancer is trapped in the mind of a person without access to >>>>> education? >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster@xxxxxxxxxx >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>> >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer >>> is trapped in the mind of a person without access to education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster@xxxxxxxxxx >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster