Re: Not restarting "max_restart" times before relocating failed service

Parvez Shaikh <parvez.h.shaikh@xxxxxxxxx> · Wed, 31 Oct 2012 10:18:18 +0530

Digimer,

Out put of rpm -q cman

cman-2.0.115-34.el5

There is no http mentioned in fencedevice, I think email client is inserting it.

Thanks,
Parvez

On Wed, Oct 31, 2012 at 10:14 AM, Digimer <lists@xxxxxxxxxx> wrote:

What does 'rpm -q cman' return?

This looks very odd;

<fencedevice agent="fence_bladecenter"

> ipaddr="mm-1.mydomain.com <http://mm-1.mydomain.com>"

Please remove this for now;

 <fence_daemon clean_start="1" post_fail_delay="0"

> post_join_delay="0"/>

In general, you don't want to assume a clean start. It's asking for

trouble. The default delays are also sane. You can always come back to

this later after this issue is resolved, if you wish.

On 10/30/2012 09:20 PM, Parvez Shaikh wrote:

> Hi Digimer,

>

> cman_tool version gives following -

>

> 6.2.0 config 22

>

> Cluster.conf -

>

> <?xml version="1.0"?>

> <cluster alias="PARVEZ" config_version="22" name="PARVEZ">

>         <clusternodes>

>                 <clusternode name="myblade2" nodeid="2" votes="1">

>                         <fence>

>                                 <method name="1">

>                                         <device blade="2"

> missing_as_off="1" name="BladeCenterFencing-1"/>

>                                 </method>

>                         </fence>

>                 </clusternode>

>                 <clusternode name="myblade1" nodeid="1" votes="1">

>                         <fence>

>                                 <method name="1">

>                                         <device blade="1"

> missing_as_off="1" name="BladeCenterFencing-1"/>

>                                 </method>

>                         </fence>

>                 </clusternode>

>         </clusternodes>

>         <cman expected_votes="1" two_node="1"/>

>         <fencedevices>

>                 <fencedevice agent="fence_bladecenter"

> ipaddr="mm-1.mydomain.com <http://mm-1.mydomain.com>" login="XXXX"

> name="BladeCenterFencing-1" passwd="XXXXX" shell_timeout="10"/>

>         </fencedevices>

>         <rm>

>                 <resources>

>                         <script file="/localhome/my/my_ha"

> name="myHaAgent"/>

>                         <ip address="192.168.51.51" monitor_link="1"/>

>                 </resources>

>                 <failoverdomains>

>                         <failoverdomain name="mydomain" nofailback="1"

> ordered="1" restricted="1">

>                                 <failoverdomainnode name="myblade2"

> priority="2"/>

>                                 <failoverdomainnode name="myblade1"

> priority="1"/>

>                         </failoverdomain>

>                 </failoverdomains>

>                 <service autostart="0" domain="mydomain" exclusive="0"

> max_restarts="5" name="mgmt" recovery="restart">

>                         <script ref="myHaAgent"/>

>                         <ip ref="192.168.51.51"/>

>                 </service>

>         </rm>

>         <fence_daemon clean_start="1" post_fail_delay="0"

> post_join_delay="0"/>

> </cluster>

>

> Thanks,

> Parvez

>

> On Tue, Oct 30, 2012 at 9:25 PM, Digimer <lists@xxxxxxxxxx

> <mailto:lists@xxxxxxxxxx>> wrote:

>

>     On 10/30/2012 01:54 AM, Parvez Shaikh wrote:

>     > Hi experts,

>     >

>     > I have defined a service as follows in cluster.conf -

>     >

>     >                 <service autostart="0" domain="mydomain" exclusive="0"

>     > max_restarts="5" name="mgmt" recovery="restart">

>     >                         <script ref="myHaAgent"/>

>     >                         <ip ref="192.168.51.51"/>

>     >                 </service>

>     >

>     > I mentioned max_restarts=5 hoping that if cluster fails to start

>     service

>     > 5 times, then it will relocate to another cluster node in failover

>     domain.

>     >

>     > To check this, I turned down NIC hosting service's floating IP and got

>     > following logs -

>     >

>     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> Link for eth1: Not

>     > detected

>     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...

>     > Oct 30 14:11:49 XXXX clurgmgrd: [10753]: <warning> No link on eth1...

>     > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> status on ip

>     > "192.168.51.51" returned 1 (generic error)

>     > Oct 30 14:11:49 XXXX clurgmgrd[10753]: <notice> Stopping service

>     > service:mgmt

>     > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service

>     service:mgmt is

>     > recovering*

>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Recovering failed

>     > service service:mgmt

>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> start on ip

>     > "192.168.51.51" returned 1 (generic error)

>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #68: Failed to start

>     > service:mgmt; return value: 1

>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Stopping service

>     > service:mgmt

>     > *Oct 30 14:12:00 XXXX clurgmgrd[10753]: <notice> Service

>     service:mgmt is

>     > recovering

>     > Oct 30 14:12:00 XXXX clurgmgrd[10753]: <warning> #71: Relocating

>     failed

>     > service service:mgmt*

>     > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service

>     service:mgmt is

>     > stopped

>     > Oct 30 14:12:01 XXXX clurgmgrd[10753]: <notice> Service

>     service:mgmt is

>     > stopped

>     >

>     > But from the log it appears that cluster tried to restart service only

>     > ONCE before relocating.

>     >

>     > I was expecting cluster to retry starting this service five times

>     on the

>     > same node before relocating

>     >

>     > Can anybody correct my understanding?

>     >

>     > Thanks,

>     > Parvez

>

>     What version? Please paste your full cluster.conf.

>

>     --

>     Digimer

>     Papers and Projects: https://alteeve.ca/w/

>     What if the cure for cancer is trapped in the mind of a person without

>     access to education?

>

>

--

Digimer

Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person without

access to education?

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster