Re: Node is randomly fenced

"Schaefer, Micah" <Micah.Schaefer@xxxxxxxxxx> · Wed, 11 Jun 2014 14:21:59 -0400

It failed again, even after deleting all the other failover domains.

Cluster conf
http://pastebin.com/jUXkwKS4

I turned corosync output to debug. How can I go about troubleshooting if
it really is a network issue or something else?

Jun 09 13:06:59 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 11 14:10:17 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 11 14:10:29 corosync [QUORUM] Members[3]: 1 2 3
Jun 11 14:10:29 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:10:29 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 11 14:10:29 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:13:54 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:13:54 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:13:54 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:14:07 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:14:08 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:14:08 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:14:21 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:14:21 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:14:21 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 11 14:14:43 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 11 14:14:43 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 11 14:14:43 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

On 6/4/14, 11:32 AM, "Schaefer, Micah" <Micah.Schaefer@xxxxxxxxxx> wrote:

>Logs: http://pastebin.com/QCh5FzZu
>
>I have one 10gb nic connected
>
>
>Here is the corosync log from node1, I see that is says ³ A processor
>failed, forming new configuration.², I need to dig deeper though.
>
>
>May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4
>May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4
>Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new
>configuration.
>Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4
>Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:52:46 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:4 left:1)
>Jun 03 13:52:46 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:56:14 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:56:14 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:56:28 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:56:28 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:56:41 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:56:41 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 13:57:04 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 13:57:04 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the
>membership and a new membership was formed.
>Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
>Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
>Jun 03 15:12:09 corosync [CPG   ] chosen downlist: sender r(0)
>ip(10.70.100.101) ; members(old:3 left:0)
>Jun 03 15:12:09 corosync [MAIN  ] Completed service synchronization, ready
>to provide service.
>
>
>
>
>
>
>
>
>
>
>
>
>On 6/4/14, 11:13 AM, "Digimer" <lists@xxxxxxxxxx> wrote:
>
>>On 04/06/14 10:59 AM, Schaefer, Micah wrote:
>>> I have a 4 node cluster, running a single service group. I have been
>>> seeing node1 fence node3 while node3 is actively running the service
>>>group
>>> at random intervals.
>>>
>>> Rgmanager logs show no failures in service checks, and no other logs
>>> provide any useful information. How can I go about finding out why
>>>node1
>>> is fencing node3?
>>>
>>> I currently set up the failover domain to be restricted and not include
>>> node3.
>>>
>>> cluster.conf : http://pastebin.com/xYy6xp6N
>>
>>Random fencing is almost always caused by network failures. Can you look
>>are the system logs, starting a little before the fence and continuing
>>until after the fence completes, and paste them here? I suspect you will
>>see corosync complaining.
>>
>>If this is true, do your switches support persistent multicast? Do you
>>use active/passive bonding? Have you tried different switch/cable/NIC?
>>
>>-- 
>>Digimer
>>Papers and Projects: https://alteeve.ca/w/
>>What if the cure for cancer is trapped in the mind of a person without
>>access to education?
>>
>>-- 
>>Linux-cluster mailing list
>>Linux-cluster@xxxxxxxxxx
>>https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
>-- 
>Linux-cluster mailing list
>Linux-cluster@xxxxxxxxxx
>https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster