Re: Cluster Suite 4 failover problem

Jonathan Daniels <jon.daniels@xxxxxxxxxxx> · Thu, 19 Oct 2006 16:45:44 +0100

Hi,

What is output to the "/var/log/messages" files of each node? That 
should provide a clue as to what the problem is.  Also, did you install 
the 'fence' RPM and any Clustered LVM / GFS RPMs?

You also might consider rebooting the "downed" node - this function is 
generally taken care of by fencing devices automatically and, as I 
understand it, "manual fencing" means you gotta reboot :), the 
assumption being that a failed node won't be allowed back in the cluster 
until it's restarted.

Thanks,
Jon

Dicky wrote:

Hi All,

I have two machines (named node1 -->192.168.0.27 and node2 
-->192.168.0.28) installed Red Hat Cluster Suite 4 with DLM with 1 NIC 
for each machine. I have created a manual fence, a failover domain, 
two services (1st service is "www - listening address is 
192.168.0.111" , 2nd service is "ftp - listening address is 
192.168.0.112).

After having the initital setup, everything seems working fine, i can 
relocate the service from node1 to node 2 or vice versa manually, stop 
and start the services.

But when i tried to test the failover capibility, i.e. shutdown the 
network service in one node e.g. shutdown the  eth0 of node1, the 
failed service won't work in most time, following was the scenarios i 
tested:

Scenario: Running services running in node1, then i shutdown the eth0 
of node1

Result: Services not failover to node2, and the clustat in node1 shows 
that:

Member Status: Quorate

 Member Name                      Status
 ------ ----                              ------
 node1                                    Offline
 node2                                    Online, Local, rgmanager

 Service Name     Owner (Last)                   State
 ------- ----         ----- ------                       -----
 ftp                       unkonwn                          started
 www                   unkonwn                          started

Both services were no longer working. when i restarted the eth0 in 
node1, restarted the cman service in node1, it still didn't work. 
Also, when i tried to restart the rgmanager in node1, it only showed 
that "Waiting for services to stop: " and wating forever. Even i tried 
to kill the process of the rgmanager, it didn't work. Finally, i  have 
to reset both machines to get the cluster service back to normal.

I would appreciate if anyone could help or anyone can share if they 
also got such experience before.
I also attached the cluster.conf below for any reference.

======cluster.conf=========
<?xml version="1.0"?>
<cluster config_version="34" name="alpha_cluster">
       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
       <clusternodes>
               <clusternode name="node1" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="Fence" 
nodename="node1"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="node2" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="Fence" 
nodename="node2"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <cman expected_votes="1" two_node="1"/>
       <fencedevices>
               <fencedevice agent="fence_manual" name="Fence"/>
       </fencedevices>
       <rm>
               <failoverdomains>
                       <failoverdomain name="aaa" ordered="0" 
restricted="0">
                               <failoverdomainnode name="node1" 
priority="1"/>
                               <failoverdomainnode name="node2" 
priority="1"/>
                       </failoverdomain>
               </failoverdomains>
               <resources>
                       <ip address="192.168.0.111" monitor_link="0"/>
                       <script file="/etc/rc.d/init.d/httpd" name="www"/>
                       <script file="/etc/rc.d/init.d/vsftpd" 
name="ftp"/>
                       <ip address="192.168.0.112" monitor_link="0"/>
               </resources>
               <service autostart="1" domain="aaa" name="ftp" 
recovery="relocate">
                       <ip ref="192.168.0.112"/>
                       <script ref="ftp"/>
               </service>
               <service autostart="1" domain="aaa" name="www" 
recovery="relocate">
                       <ip ref="192.168.0.111"/>
                       <script ref="www"/>
               </service>
       </rm>
</cluster>
==========END==========

Many Thanks,
Dicky

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster