On 6/12/07, Marc Grimme <grimme@xxxxxxx> wrote:
On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote: > On 6/11/07, Robert Gil <Robert.Gil@xxxxxxxxxxxxxx> wrote: > > If ilo itself is off, fencing doesn't work. > > Isn't there any timeout setting such that if the ILO doesn't respond > for a certain amount of time, it is treated as fenced and the node is > considered to be dead and the failover takes place? As far as I remember there is only a tcp-timeout when establishing the connection to the ilo-card that takes a very long time to occure (that's a default setting and takes minutes). I'm not sure how and where to set it.
We did wait for quite some time and followed the messages appearing in /var/log/messages. It kept on trying to contact the ILO of the node which was powered off.
But we've had this discussion (especially with ILO-Cards) nearly every time when using them and therefore and also out of other reasons we had to build our own fence_ilo agent. I'm quite sure that we solved the timeout problem in the end. It is set to 10sec per default (Config.timeout). You can find it at http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm or directly use the yum/up2date-channel as described here: http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/ then install "comoonics-bootimage-fenceclient-ilo" and there you go.
Thanks, I will try and see if they agree to use this version.
> > > Did you add ilo as a fence device? And create a user? You create a user > > in the ilo for that blade, not on the chassis. You have to reboot the > > blade to get to the ilo manager. > > Yes, had added respective ILOs as fence devices for both the servers > and created users also. We are doing so as well. Always a power user for ilo devices. We are also automating this with the ilo client. There is a undocumented switch -x in the fence_ilo client referenced above where you reference a file that might look as follows and you'll have your user. > I just want to make sure that automatic fencing happens and failover > takes place even when there is a complete power failure for one node If the timeout thing works you'll also need a second fence mechanism. You might think about using fence_manual as last resort, to bring that cluster back online after power failure and then after manual intervention. Regards Marc.
Just wondering if there is any undocumented option / switch which will force an automatic failover to one node if the ILO on the other one fails to respond within certain time period (maybe few minutes). Regards, -- Manish -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster