Re: Node is randomly fenced

"Schaefer, Micah" <Micah.Schaefer@xxxxxxxxxx> · Thu, 12 Jun 2014 13:55:35 -0400

I just found that the clock on node1 was off by about a minute and a half
compared to the rest of the nodes.

I am running ntp, so not sure why the time wasn’t synced up. Wonder if
node1 being behind, would think it was not receiving updates from the
other nodes? 

On 6/12/14, 1:29 PM, "Digimer" <lists@xxxxxxxxxx> wrote:

>Even if the token changes stop the immediate fencing, don't leave it
>please. There is something fundamentally wrong that you need to
>identify/fix.
>
>Keep us posted!
>
>On 12/06/14 01:24 PM, Schaefer, Micah wrote:
>> The servers do not run any tasks other than the tasks in the cluster
>> service group.
>>
>> Nodes 3 and 4 are physical servers with a lot of horsepower and nodes 1
>> and 2 are virtual machines with much less resources available.
>>
>> I adjusted the token settings and will watch for any change.
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6/12/14, 1:08 PM, "Digimer" <lists@xxxxxxxxxx> wrote:
>>
>>> On 12/06/14 12:48 PM, Schaefer, Micah wrote:
>>>> As far as the switch goes, both are Cisco Catalyst 6509-E, no spanning
>>>> tree changes are happening and all the ports have port-fast enabled
>>>>for
>>>> these servers. My switch logging level is very high and I have no
>>>> messages
>>>> in relation to the time frames or ports.
>>>>
>>>> TOTEM reports that ³A processor joined or left the membershipŠ², but
>>>> that
>>>> isn¹t enough detail.
>>>>
>>>> Also note that I did not have these issues until adding new servers:
>>>> node3
>>>> and node4 to the cluster. Node1 and node2 do not fence each other
>>>> (unless
>>>> a real issue is there), and they are on different switches.
>>>
>>> Then I can't imagine it being network anymore. Seeing as both node 3
>>>and
>>> 4 get fenced, it's likely not hardware either. Are the workloads on 3
>>> and 4 much higher (or are the computers much slower) than 1 and 2? I'm
>>> wondering if the nodes are simply not keeping up with corosync traffic.
>>> You might try adjusting the corosync token timeout and retransmit
>>>counts
>>> to see if that reduces the node loses.
>>>
>>> --
>>> Digimer
>>> Papers and Projects: https://alteeve.ca/w/
>>> What if the cure for cancer is trapped in the mind of a person without
>>> access to education?
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster@xxxxxxxxxx
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>
>
>-- 
>Digimer
>Papers and Projects: https://alteeve.ca/w/
>What if the cure for cancer is trapped in the mind of a person without
>access to education?
>
>-- 
>Linux-cluster mailing list
>Linux-cluster@xxxxxxxxxx
>https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster