It failed again, even after deleting all the other failover domains. Cluster conf http://pastebin.com/jUXkwKS4 I turned corosync output to debug. How can I go about troubleshooting if it really is a network issue or something else? Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4 Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new configuration. Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3 Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:10:29 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:4 left:1) Jun 11 14:10:29 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:13:54 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:13:54 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:13:54 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:14:07 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:14:08 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:14:08 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:14:21 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:14:21 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:14:21 corosync [MAIN ] Completed service synchronization, ready to provide service. Jun 11 14:14:43 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Jun 11 14:14:43 corosync [CPG ] chosen downlist: sender r(0) ip(10.70.100.101) ; members(old:3 left:0) Jun 11 14:14:43 corosync [MAIN ] Completed service synchronization, ready to provide service. On 6/4/14, 11:32 AM, "Schaefer, Micah" <Micah.Schaefer@xxxxxxxxxx> wrote: >Logs: http://pastebin.com/QCh5FzZu > >I have one 10gb nic connected > > >Here is the corosync log from node1, I see that is says ³ A processor >failed, forming new configuration.², I need to dig deeper though. > > >May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4 >May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4 >Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new >configuration. >Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4 >Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:52:46 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:4 left:1) >Jun 03 13:52:46 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:56:14 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:56:14 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:56:28 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:56:28 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:56:41 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:56:41 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 13:57:04 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 13:57:04 corosync [MAIN ] Completed service synchronization, ready >to provide service. >Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the >membership and a new membership was formed. >Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4 >Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4 >Jun 03 15:12:09 corosync [CPG ] chosen downlist: sender r(0) >ip(10.70.100.101) ; members(old:3 left:0) >Jun 03 15:12:09 corosync [MAIN ] Completed service synchronization, ready >to provide service. > > > > > > > > > > > > >On 6/4/14, 11:13 AM, "Digimer" <lists@xxxxxxxxxx> wrote: > >>On 04/06/14 10:59 AM, Schaefer, Micah wrote: >>> I have a 4 node cluster, running a single service group. I have been >>> seeing node1 fence node3 while node3 is actively running the service >>>group >>> at random intervals. >>> >>> Rgmanager logs show no failures in service checks, and no other logs >>> provide any useful information. How can I go about finding out why >>>node1 >>> is fencing node3? >>> >>> I currently set up the failover domain to be restricted and not include >>> node3. >>> >>> cluster.conf : http://pastebin.com/xYy6xp6N >> >>Random fencing is almost always caused by network failures. Can you look >>are the system logs, starting a little before the fence and continuing >>until after the fence completes, and paste them here? I suspect you will >>see corosync complaining. >> >>If this is true, do your switches support persistent multicast? Do you >>use active/passive bonding? Have you tried different switch/cable/NIC? >> >>-- >>Digimer >>Papers and Projects: https://alteeve.ca/w/ >>What if the cure for cancer is trapped in the mind of a person without >>access to education? >> >>-- >>Linux-cluster mailing list >>Linux-cluster@xxxxxxxxxx >>https://www.redhat.com/mailman/listinfo/linux-cluster > > >-- >Linux-cluster mailing list >Linux-cluster@xxxxxxxxxx >https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster