Re: Fencing problem w/ 2-node VM when a VM host dies

Digimer <lists@xxxxxxxxxx> · Fri, 4 Dec 2015 12:49:05 -0500

On 04/12/15 09:14 AM, Kelvin Edmison wrote:
> 
> 
> On 12/03/2015 09:31 PM, Digimer wrote:
>> On 03/12/15 08:39 PM, Kelvin Edmison wrote:
>>> On 12/03/2015 06:14 PM, Digimer wrote:
>>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote:
>>>>> I am hoping that someone can help me understand the problems I'm
>>>>> having
>>>>> with linux clustering for VMs.
>>>>>
>>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure
>>>>> that a
>>>>> service is always available.  The hosts and guests are both RHEL 6.7.
>>>>> The goal is to have only one of the two VMs running at a time.
>>>>>
>>>>> The configuration works when we test/simulate VM deaths and
>>>>> graceful VM
>>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ).
>>>>>
>>>>> However, when we simulate the sudden isolation of host A (e.g. ifdown
>>>>> eth0), two things happen
>>>>> 1) the VM on host B does not start, and repeated fence_xvm errors
>>>>> appear
>>>>> in the logs on host B
>>>>> 2) when the 'failed' node is returned to service, the cman service on
>>>>> host B dies.
>>>> If the node's host is dead, then there is no way for the survivor to
>>>> determine the state of the lost VM node. The cluster is not allowed to
>>>> take "no answer" as confirmation of fence success.
>>>>
>>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup
>>>> method where, if fence_xvm fails, it moves on and reboots the host
>>>> itself.
>>> Thank you for the suggestion.  The hosts do have ipmi.  I'll explore it
>>> but I'm a little concerned about what it means for the other
>>> non-clustered VM workloads that exist on these two servers.
>>>
>>> Do you have any thoughts as to why host B's cman process is dying when
>>> 'host A' returns?
>>>
>>> Thanks,
>>>    Kelvin
>> It's not dieing, it's blocking. When a node is lost, dlm blocks until
>> fenced tells it that the fence was successful. If fenced can't contact
>> the lost node's fence method(s), then it doesn't succeed and dlm stays
>> blocked. To anything that uses DLM, like rgmanager, it appears like the
>> host is hung but it is by design. The logic is that, as bad as it is to
>> hang, it's better than risking a split-brain.
> when I said the cman service is dying, I should have further qualified
> it. I mean that the corosync process is no longer running (ps -ef | grep
> corosync does not show it)  and after recovering the failed host A,
> manual intervention (service cman start) was required on host B to
> recover full cluster services.
> 
> [root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do
> printf "%-12s   " $SERVICE; service $SERVICE status; done
> ricci          ricci (pid  5469) is running...
> fence_virtd    fence_virtd (pid  4862) is running...
> cman           Found stale pid file
> rgmanager      rgmanager (pid  5366) is running...
> 
> 
> Thanks,
>   Kelvin

Oh now that is interesting...

You'll want input from Fabio, Chrissie or one of the other core devs, I
suspect.

If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and
if you can reproduce it reliably, can you create a new thread with the
reproducer?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster