Re: Not restarting "max_restart" times before relocating failed service

emmanuel segura <emi2fast@xxxxxxxxx> · Wed, 31 Oct 2012 10:23:20 +0100

Hello

Maybe you missing recovery="restart" in your services

2012/10/31 Parvez Shaikh <parvez.h.shaikh@xxxxxxxxx>

Hi Digimer,

cman_tool version gives following -

6.2.0 config 22

Cluster.conf -

<?xml version="1.0"?>

<cluster alias="PARVEZ" config_version="22" name="PARVEZ">

        <clusternodes>
                <clusternode name="myblade2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">

                                        <device blade="2" missing_as_off="1" name="BladeCenterFencing-1"/>
                                </method>
                        </fence>

                </clusternode>
                <clusternode name="myblade1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">

                                        <device blade="1" missing_as_off="1" name="BladeCenterFencing-1"/>
                                </method>
                        </fence>

                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices>
                <fencedevice agent="fence_bladecenter" ipaddr="mm-1.mydomain.com" login="XXXX" name="BladeCenterFencing-1" passwd="XXXXX" shell_timeout="10"/>

        </fencedevices>
        <rm>
                <resources>
                        <script file="/localhome/my/my_ha" name="myHaAgent"/>
                        <ip address="192.168.51.51" monitor_link="1"/>

                </resources>
                <failoverdomains>
                        <failoverdomain name="mydomain" nofailback="1" ordered="1" restricted="1">

                                <failoverdomainnode name="myblade2" priority="2"/>
                                <failoverdomainnode name="myblade1" priority="1"/>

                        </failoverdomain>
                </failoverdomains>
                <service autostart="0" domain="mydomain" exclusive="0" max_restarts="5" name="mgmt" recovery="restart">

                        <script ref="myHaAgent"/>
                        <ip ref="192.168.51.51"/>
                </service>
        </rm>
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="0"/>

</cluster>

Thanks,
Parvez

On Tue, Oct 30, 2012 at 9:25 PM, Digimer <lists@xxxxxxxxxx> wrote:

On 10/30/2012 01:54 AM, Parvez Shaikh wrote:

> Hi experts,

>

> I have defined a service as follows in cluster.conf -

>

>                 <service autostart="0" domain="mydomain" exclusive="0"

> max_restarts="5" name="mgmt" recovery="restart">

>                         <script ref="myHaAgent"/>

>                         <ip ref="192.168.51.51"/>

>                 </service>

>

> I mentioned max_restarts=5 hoping that if cluster fails to start service

> 5 times, then it will relocate to another cluster node in failover domain.

>

> To check this, I turned down NIC hosting service's floating IP and got

> following logs -

>

> Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> Link for eth1: Not

> detected

> Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...

> Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...

> Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> status on ip

> "192.168.51.51" returned 1 (generic error)

> Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> Stopping service

> service:mgmt

> *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is

> recovering*

> Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Recovering failed

> service service:mgmt

> Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> start on ip

> "192.168.51.51" returned 1 (generic error)

> Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #68: Failed to start

> service:mgmt; return value: 1

> Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Stopping service

> service:mgmt

> *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is

> recovering

> Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #71: Relocating failed

> service service:mgmt*

> Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is

> stopped

> Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service service:mgmt is

> stopped

>

> But from the log it appears that cluster tried to restart service only

> ONCE before relocating.

>

> I was expecting cluster to retry starting this service five times on the

> same node before relocating

>

> Can anybody correct my understanding?

>

> Thanks,

> Parvez

What version? Please paste your full cluster.conf.

--

Digimer

Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person without

access to education?

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
esta es mi vida e me la vivo hasta que dios quiera

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster