On 04/12/15 01:52 PM, Kelvin Edmison wrote: > > > On 12/04/2015 12:49 PM, Digimer wrote: >> On 04/12/15 09:14 AM, Kelvin Edmison wrote: >>> >>> On 12/03/2015 09:31 PM, Digimer wrote: >>>> On 03/12/15 08:39 PM, Kelvin Edmison wrote: >>>>> On 12/03/2015 06:14 PM, Digimer wrote: >>>>>> On 03/12/15 02:19 PM, Kelvin Edmison wrote: >>>>>>> I am hoping that someone can help me understand the problems I'm >>>>>>> having >>>>>>> with linux clustering for VMs. >>>>>>> >>>>>>> I am clustering 2 VMs on two separate VM hosts, trying to ensure >>>>>>> that a >>>>>>> service is always available. The hosts and guests are both RHEL >>>>>>> 6.7. >>>>>>> The goal is to have only one of the two VMs running at a time. >>>>>>> >>>>>>> The configuration works when we test/simulate VM deaths and >>>>>>> graceful VM >>>>>>> host shutdowns, and administrative switchovers (i.e. clusvcadm -r ). >>>>>>> >>>>>>> However, when we simulate the sudden isolation of host A (e.g. >>>>>>> ifdown >>>>>>> eth0), two things happen >>>>>>> 1) the VM on host B does not start, and repeated fence_xvm errors >>>>>>> appear >>>>>>> in the logs on host B >>>>>>> 2) when the 'failed' node is returned to service, the cman >>>>>>> service on >>>>>>> host B dies. >>>>>> If the node's host is dead, then there is no way for the survivor to >>>>>> determine the state of the lost VM node. The cluster is not >>>>>> allowed to >>>>>> take "no answer" as confirmation of fence success. >>>>>> >>>>>> If your hosts have IPMI, then you could add fence_ipmilan as a backup >>>>>> method where, if fence_xvm fails, it moves on and reboots the host >>>>>> itself. >>>>> Thank you for the suggestion. The hosts do have ipmi. I'll >>>>> explore it >>>>> but I'm a little concerned about what it means for the other >>>>> non-clustered VM workloads that exist on these two servers. >>>>> >>>>> Do you have any thoughts as to why host B's cman process is dying when >>>>> 'host A' returns? >>>>> >>>>> Thanks, >>>>> Kelvin >>>> It's not dieing, it's blocking. When a node is lost, dlm blocks until >>>> fenced tells it that the fence was successful. If fenced can't contact >>>> the lost node's fence method(s), then it doesn't succeed and dlm stays >>>> blocked. To anything that uses DLM, like rgmanager, it appears like the >>>> host is hung but it is by design. The logic is that, as bad as it is to >>>> hang, it's better than risking a split-brain. >>> when I said the cman service is dying, I should have further qualified >>> it. I mean that the corosync process is no longer running (ps -ef | grep >>> corosync does not show it) and after recovering the failed host A, >>> manual intervention (service cman start) was required on host B to >>> recover full cluster services. >>> >>> [root@host2 ~]# for SERVICE in ricci fence_virtd cman rgmanager; do >>> printf "%-12s " $SERVICE; service $SERVICE status; done >>> ricci ricci (pid 5469) is running... >>> fence_virtd fence_virtd (pid 4862) is running... >>> cman Found stale pid file >>> rgmanager rgmanager (pid 5366) is running... >>> >>> >>> Thanks, >>> Kelvin >> Oh now that is interesting... >> >> You'll want input from Fabio, Chrissie or one of the other core devs, I >> suspect. >> >> If this is RHEL proper, can you open a rhbz ticket? If it's CentOS, and >> if you can reproduce it reliably, can you create a new thread with the >> reproducer? > It's RHEL proper in both host and guest, and we can reproduce it reliably. Excellent! Please reply here with the rhbz#. I'm keen to see what comes of it. -- Digimer Papers and Projects: https://alteeve.ca/w/ What if the cure for cancer is trapped in the mind of a person without access to education? -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster