Okay, I set up active/ backup bonding and will watch for any change. This is the network side: 0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored 0 output errors, 0 collisions, 0 interface resets This is the server side: em1 Link encap:Ethernet HWaddr C8:1F:66:EB:46:FD inet addr:x.x.x.x Bcast:x.x.x.255 Mask:255.255.255.0 inet6 addr: fe80::ca1f:66ff:feeb:46fd/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:41274798 errors:0 dropped:0 overruns:0 frame:0 TX packets:4459245 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:18866207931 (17.5 GiB) TX bytes:1135415651 (1.0 GiB) Interrupt:34 Memory:d5000000-d57fffff I need to run some fiber, but for now two nodes are plugged into one switch and the other two nodes into a separate switch that are on the same subnet. I’ll work on cross connecting the bonded interfaces to different switches. On 6/11/14, 3:28 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >The first thing I would do is get a second NIC and configure >active-passive bonding. network issues are too common to ignore in HA >setups. Ideally, I would span the links across separate stacked switches. > >As for debugging the issue, I can only recommend to look closely at the >system and switch logs for clues. > >On 11/06/14 02:55 PM, Schaefer, Micah wrote: >> I have the issue on two of my nodes. Each node has 1ea 10gb connection. >>No >> bonding, single link. What else can I look at? I manage the network >>too. I >> don¹t see any link down notifications, don¹t see any errors on the >>ports. >> >> >> >> >> On 6/11/14, 2:29 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >> >>> On 11/06/14 02:21 PM, Schaefer, Micah wrote: >>>> It failed again, even after deleting all the other failover domains. >>>> >>>> Cluster conf >>>> http://pastebin.com/jUXkwKS4 >>>> >>>> I turned corosync output to debug. How can I go about troubleshooting >>>>if >>>> it really is a network issue or something else? >>>> >>>> >>>> >>>> Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 >>>> Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new >>>> configuration. >>>> Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 >>>> Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the >>>> membership and a new membership was formed. >>>> Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) >>>> ip(10.70.100.101) ; members(old:4 left:1) >>> >>> This is, to me, *strongly* indicative of a network issue. It's not >>> likely switch-wide as only one member was lost, but I would certainly >>> put my money on a network problem somewhere, some how. >>> >>> Do you use bonding? >>> >>> -- >>> Digimer >>> Papers and Projects: https://alteeve.ca/w/ >>> What if the cure for cancer is trapped in the mind of a person without >>> access to education? >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster@xxxxxxxxxx >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster