Re: why all services stops when a node reboots?

ESGLinux <esggrupos@xxxxxxxxx> · Fri, 13 Feb 2009 11:44:35 +0100

hello all 

following with the problem, anyone can explain this:

The commands are run all in aprox 1 minute:

disable the service 
[root@NODE2 log]# clusvcadm -d BBDD
Local machine disabling service:BBDD...Yes

enable the service
[root@NODE2 log]# clusvcadm -e BBDD
Local machine trying to enable service:BBDD...Success
service:BBDD is now running on node2

its ok, the service is running in node2, try to relocate to node1

root@NODE2 log]# clusvcadm -r BBDD -m node1
Trying to relocate service:BBDD to node1...Success

it works!!! fine, try to relocate again to node2

service:BBDD is now running on node1
[root@NODE2 log]# clusvcadm -r BBDD -m node2

Trying to relocate service:BBDD to node2...Success

it works again !!! I cant believe it. Try to relocate to node1 again

service:BBDD is now running on node2
[root@NODE2 log]# clusvcadm -r BBDD -m node1

Trying to relocate service:BBDD to node1...Failure

Opps!! it fails!!! Why? why 30 secs before it works and now it fails?

In this situation all I can do is enable an disable the service again to get it works. It never gets up automatically...

[root@NODE2 log]# clusvcadm -d BBDD
Local machine disabling service:BBDD...Yes
[root@NODE2 log]# clusvcadm -e BBDD
Local machine trying to enable service:BBDD...Success
service:BBDD is now running on node2

Any explanation for this behaviour???

I´m complety astonished :-(

TIA

ESG

2009/2/13 ESGLinux <esggrupos@xxxxxxxxx>

More clues, 

using system-config-cluster 

When I try to run a service in state failed I always get an error.

I have tu disable the service, to get disabled state. With this state I can restart the services.

I think I have a problem with the relocate because I cant do it nor with luci nor with system-config-cluster nor with clusvadm

I always get error when i try this

greetings

ESG

2009/2/13 ESGLinux <esggrupos@xxxxxxxxx>

Hello, 

The services run ok on node1. If I halt node2 and try to run the services the run ok on node1. 
If I run the services without cluster they also run ok.

I have eliminated the HTTP services and I have left the service BBDD to debug the problem. Here is the log when the service is running on node2 and node1 comes up:

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering GATHER state from 11.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Creating commit token because I am                                                                              the rep.

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Saving state aru 1a high seq receiv                                                                             ed 1a
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Storing new sequence id for ring 17                                                                             f4

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering COMMIT state.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering RECOVERY state.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [0] member 192.168.1.185:

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep 192.168.                                                                             1.185
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 1a high delivered 1a received f                                                                             lag 1

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [1] member 192.168.1.188:
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep 192.168.                                                                             1.188

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 9 high delivered 9 received fla                                                                             g 1
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Did not need to originate any messa                                                                             ges in recovery.

Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Sending initial ORF token
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)

Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:

Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:

Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
Feb 13 09:16:00 NODE2 openais[3326]: [SYNC ] This node is within the primary component and will provide service.
Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering OPERATIONAL state.

Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message 192.168.1.185
Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message 192.168.1.188
Feb 13 09:16:00 NODE2 openais[3326]: [CPG  ] got joinlist message from node 2

Feb 13 09:16:03 NODE2 kernel: dlm: connecting to 1
Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Relocating service:BBDD to better node node1
Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Stopping service service:BBDD

Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb > Failed - Application Is Still Running
Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb > Failed

Feb 13 09:16:25 NODE2 clurgmgrd[4001]: <notice> stop on mysql "mydb" returned 1 (generic error)
Feb 13 09:16:25 NODE2 avahi-daemon[3872]: Withdrawing address record for 192.168.1.183 on eth0.
Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <crit> #12: RG service:BBDD failed to stop; intervention required

Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <notice> Service service:BBDD is failed
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <warning> #70: Failed to relocate service:BBDD; restarting locally
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <err> #43: Service service:BBDD has failed; can not start.

Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #2: Service service:BBDD returned failure code.  Last Owner: node2
Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #4: Administrator intervention required.

As you can see in the message "Relocating service:BBDD to better node node1"

But it fails

Another error that appears frecuently in my logs is the next:

<err> Checking Existence Of File /var/run/cluster/mysql/mysql:mydb.pid [mysql:mydb] > Failed - File Doesn't Exist

I dont know if this is important. but I think this makes the message err> Stopping Service mysql:mydb > Failed - Application Is Still Running and this makes the service fails (I´m just guessing...)

Any idea?

ESG

2009/2/12 rajveer singh <torajveersingh@xxxxxxxxx>

Hi,

Ok, perhaps there is some problem with the services on node1 , so, are you able to run these services on node1 without cluster. You first stop the cluster, and try to run these services on node1. 

It should run.

Re,
Rajveer Singh

2009/2/13 ESGLinux <esggrupos@xxxxxxxxx>

Hello, 

Thats what I want, when node1 comes up I want to relocate to node1 but what I get is all my services stoped and in failed state.

With my configuration I expect to have the services running on node1. 

Any idea about this behaviour?

Thanks

ESG

2009/2/12 rajveer singh <torajveersingh@xxxxxxxxx> 

2009/2/12 ESGLinux <esggrupos@xxxxxxxxx>

Hello all, 

I´m testing a cluster using luci as admin tool. I have configured 2 nodes with 2 services http + mysql. This configuration works almost fine. I have the services running on the node1
 and y reboot this node1. Then the services relocates to node2 and all contnues working but, when the node1 goes up all the services stops. 

I think that the node1, when comes alive, tries to run the services and that makes the services stops, can it be true? I think node1 should not start anything because the services are running in node2. 

Perphaps is a problem with the configuration, perhaps with fencing (i have not configured fencing at all)

here is my cluster.conf. Any idea? 

Thanks in advace

ESG

<?xml version="1.0"?>
<cluster alias="MICLUSTER" config_version="29" name="MICLUSTER">

        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">

                        <fence/>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                        <fence/>

                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1"/>
        <fencedevices/>
        <rm>
                <failoverdomains>

                        <failoverdomain name="DOMINIOFAIL" nofailback="0" ordere
d="1" restricted="1">
                                <failoverdomainnode name="node1" priority="1"/>

                                <failoverdomainnode name="node2" priority="2"/>
                        </failoverdomain>

                </failoverdomains>
                <resources>
                        <ip address="192.168.1.183" monitor_link="1"/>
                </resources>
                <service autostart="1" domain="DOMINIOFAIL" exclusive="0" name="

HTTP" recovery="relocate">
                        <apache config_file="conf/httpd.conf" name="http" server
_root="/etc/httpd" shutdown_wait="0"/>

                        <ip ref="192.168.1.183"/>

                </service>
                <service autostart="1" domain="DOMINIOFAIL" exclusive="0" name="
BBDD" recovery="relocate">
                        <mysql config_file="/etc/my.cnf" listen_address="192.168

.1.183" name="mydb" shutdown_wait="0"/>
                        <ip ref="192.168.1.183"/>
                </service>
        </rm>
</cluster>

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

Hi ESG,

Offcoures, as you have defined the priority of node1 as 1 and node2 as 2, so node1 is having more priority, so whenever it will be up, it will try to  run the service on itself and so it will relocate the service from node2 to node1.

Re,
Rajveer Singh

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster