Re: Node is randomly fenced

"Schaefer, Micah" <Micah.Schaefer@xxxxxxxxxx> · Wed, 4 Jun 2014 11:32:45 -0400

Logs: http://pastebin.com/QCh5FzZu

I have one 10gb nic connected

Here is the corosync log from node1, I see that is says ³ A processor
failed, forming new configuration.², I need to dig deeper though.

May 27 10:03:49 corosync [QUORUM] Members[4]: 1 2 3 4
May 27 10:05:04 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 03 13:52:34 corosync [TOTEM ] A processor failed, forming new
configuration.
Jun 03 13:52:46 corosync [QUORUM] Members[3]: 1 2 4
Jun 03 13:52:46 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:52:46 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:4 left:1)
Jun 03 13:52:46 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:56:14 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:56:14 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:56:14 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:56:28 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:56:28 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:56:28 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:56:41 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:56:41 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:56:41 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 13:57:04 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 13:57:04 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 13:57:04 corosync [MAIN  ] Completed service synchronization, ready
to provide service.
Jun 03 15:12:09 corosync [TOTEM ] A processor joined or left the
membership and a new membership was formed.
Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 03 15:12:09 corosync [QUORUM] Members[4]: 1 2 3 4
Jun 03 15:12:09 corosync [CPG   ] chosen downlist: sender r(0)
ip(10.70.100.101) ; members(old:3 left:0)
Jun 03 15:12:09 corosync [MAIN  ] Completed service synchronization, ready
to provide service.

Regards,
-------
Micah Schaefer
JHU/ APL
ITSD/ ITC
240-228-1148 (x81148)

On 6/4/14, 11:13 AM, "Digimer" <lists@xxxxxxxxxx> wrote:

>On 04/06/14 10:59 AM, Schaefer, Micah wrote:
>> I have a 4 node cluster, running a single service group. I have been
>> seeing node1 fence node3 while node3 is actively running the service
>>group
>> at random intervals.
>>
>> Rgmanager logs show no failures in service checks, and no other logs
>> provide any useful information. How can I go about finding out why node1
>> is fencing node3?
>>
>> I currently set up the failover domain to be restricted and not include
>> node3.
>>
>> cluster.conf : http://pastebin.com/xYy6xp6N
>
>Random fencing is almost always caused by network failures. Can you look
>are the system logs, starting a little before the fence and continuing
>until after the fence completes, and paste them here? I suspect you will
>see corosync complaining.
>
>If this is true, do your switches support persistent multicast? Do you
>use active/passive bonding? Have you tried different switch/cable/NIC?
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster@xxxxxxxxxx
>https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster