This is all I see for TOTEM from node1 Jun 12 11:07:10 corosync [TOTEM ] A processor failed, forming new configuration. Jun 12 11:07:22 corosync [QUORUM] Members[3]: 1 2 3 Jun 12 11:07:22 corosync [TOTEM ] A processor joined or left the membership" and a new membership was formed. Jun 12 11:07:22 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 12 11:07:22 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:10:49 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:10:49 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:10:49 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:11:02 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:11:02 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:11:06 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:11:06 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:11:06 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:11:06 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:11:35 corosync [TOTEM ] A processor failed, forming new configuration. Jun 12 11:11:47 corosync [QUORUM] Members[3]: 1 2 4 Jun 12 11:11:47 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:11:47 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 12 11:11:47 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:15:18 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:15:18 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:15:18 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:15:31 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:15:31 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:15:31 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 11:15:33 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:15:33 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 12 11:15:33 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 12 11:15:33 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 12 12:36:20 corosync [QUORUM] Members[4]: 1 2 3 4 As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning tree changes are happening and all the ports have port-fast enabled for these servers. My switch logging level is very high and I have no messages in relation to the time frames or ports. TOTEM reports that “A processor joined or left the membership…”, but that isn’t enough detail. Also note that I did not have these issues until adding new servers: node3 and node4 to the cluster. Node1 and node2 do not fence each other (unless a real issue is there), and they are on different switches. On 6/12/14, 12:36 PM, "Digimer" <lists@xxxxxxxxxx> wrote: >On 12/06/14 12:33 PM, yvette hirth wrote: >> On 06/12/2014 08:32 AM, Schaefer, Micah wrote: >> >>> Yesterday I added bonds on nodes 3 and 4. Today, node4 was active and >>> fenced, then node3 was fenced when node4 came back online. The network >>> topology is as follows: >>> switch1: node1, node3 (two connections) >>> switch2: node2, node4 (two connections) >>> switch1 <―> switch2 >>> All on the same subnet >>> >>> I set up monitoring at 100 millisecond of the nics in active-backup >>>mode, >>> and saw no messages about link problems before the fence. >>> >>> I see multicast between the servers using tcpdump. >>> >>> Any more ideas? >> >> spanning-tree scans/rebuilds happen on 10Gb circuits just like they do >> on 1Gb circuits, and when they happen, traffic on the switches *can* >> come to a grinding halt, depending upon the switch firmware and the type >> of spanning-tree scan/rebuild being done. >> >> you may want to check your switch logs to see if any spanning-tree >> rebuilds were being done at the time of the fence. >> >> just an idea, and hth >> yvette hirth > >When I've seen this (I now disable STP entirely), it blocks all traffic >so I would expect multiple/all nodes to partition off on their own. >Still, worth looking into. :) > >-- >Digimer >Papers and Projects: https://alteeve.ca/w/ >What if the cure for cancer is trapped in the mind of a person without >access to education? > >-- >Linux-cluster mailing list >Linux-cluster@xxxxxxxxxx >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster