To confirm; Have you tried with the bonds setup where each node has one link into either switch? I just want to be sure you've ruled out all the network hardware. Also please confirm that you used mode=1 (active-passive) bonding. Assuming this doesn't help, then I would say that I was wrong in assuming it was network related. The next thing I would look at is corosync. Do you see any messages about totem retransmit? On 12/06/14 11:32 AM, Schaefer, Micah wrote: > Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and > fenced, then node3 was fenced when node4 came back online. The network > topology is as follows: > switch1: node1, node3 (two connections) > switch2: node2, node4 (two connections) > switch1 <―> switch2 > All on the same subnet > > I set up monitoring at 100 millisecond of the nics in active-backup mode, > and saw no messages about link problems before the fence. > > I see multicast between the servers using tcpdump. > > > Any more ideas? > > > > > > On 6/12/14, 12:19 AM, "Digimer" <lists@xxxxxxxxxx> wrote: > >> I considered that, but I would expect more nodes to be lost. >> >> On 12/06/14 12:12 AM, Netravali, Ganesh wrote: >>> Make sure multicast is enabled across the switches. >>> >>> -----Original Message----- >>> From: linux-cluster-bounces@xxxxxxxxxx >>> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Schaefer, Micah >>> Sent: Thursday, June 12, 2014 1:20 AM >>> To: linux clustering >>> Subject: Re: Node is randomly fenced >>> >>> Okay, I set up active/ backup bonding and will watch for any change. >>> >>> This is the network side: >>> 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored >>> 0 output errors, 0 collisions, 0 interface resets >>> >>> >>> >>> This is the server side: >>> >>> em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD >>> inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 >>> inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link >>> UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 >>> RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 >>> TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 >>> collisions:0 txqueuelen:1000 >>> RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 >>> GiB) >>> Interrupt:34 Memory:d5000000-d57fffff >>> >>> >>> >>> I need to run some fiber, but for now two nodes are plugged into one >>> switch and the other two nodes into a separate switch that are on the >>> same subnet. I'll work on cross connecting the bonded interfaces to >>> different switches. >>> >>> >>> >>> On 6/11/14, 3:28 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >>> >>>> The first thing I would do is get a second NIC and configure >>>> active-passive bonding. network issues are too common to ignore in HA >>>> setups. Ideally, I would span the links across separate stacked >>>> switches. >>>> >>>> As for debugging the issue, I can only recommend to look closely at the >>>> system and switch logs for clues. >>>> >>>> On 11/06/14 02:55 PM, Schaefer, Micah wrote: >>>>> I have the issue on two of my nodes. Each node has 1ea 10gb >>>>> connection. >>>>> No >>>>> bonding, single link. What else can I look at? I manage the network >>>>> too. I don¹t see any link down notifications, don¹t see any errors on >>>>> the ports. >>>>> >>>>> >>>>> >>>>> >>>>> On 6/11/14, 2:29 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >>>>> >>>>>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>>>>> It failed again, even after deleting all the other failover domains. >>>>>>> >>>>>>> Cluster conf >>>>>>> http://pastebin.com/jUXkwKS4 >>>>>>> >>>>>>> I turned corosync output to debug. How can I go about >>>>>>> troubleshooting if it really is a network issue or something else? >>>>>>> >>>>>>> >>>>>>> >>>>>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 >>>>>>> 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>>>>> configuration. >>>>>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 >>>>>>> corosync [TOTEM ] A processor joined or left the membership and a >>>>>>> new membership was formed. >>>>>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>>>>> ip(10.70.100.101) ; members(old:4 left:1) >>>>>> >>>>>> This is, to me, *strongly* indicative of a network issue. It's not >>>>>> likely switch-wide as only one member was lost, but I would >>>>>> certainly put my money on a network problem somewhere, some how. >>>>>> >>>>>> Do you use bonding? >>>>>> >>>>>> -- >>>>>> Digimer >>>>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for >>>>>> cancer is trapped in the mind of a person without access to >>>>>> education? >>>>>> >>>>>> -- >>>>>> Linux-cluster mailing list >>>>>> Linux-cluster@xxxxxxxxxx >>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>>> >>>>> >>>> >>>> >>>> -- >>>> Digimer >>>> Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer >>>> is trapped in the mind of a person without access to education? >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.ca/w/ >> What if the cure for cancer is trapped in the mind of a person without >> access to education? >> >> -- >> Linux-cluster mailing list >> Linux-cluster@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster