On Fri, 2007-10-19 at 14:53 +0000, Glenn Aycock wrote: > We are running RHCS on RHEL 4.5 and have a basic 2-node HA cluster > configuration for a critical application in place and functional. The > config looks like this: > <?xml version="1.0"?> > <cluster config_version="16" name="routing_cluster"> > <fence_daemon post_fail_delay="0" post_join_delay="10"/> > <clusternodes> > <clusternode name="host1" votes="1"> > <fence> > <method name="1"> > <device name="manual" nodename="host1"/> > </method> > </fence> > </clusternode> > <clusternode name="host2" votes="1"> > <fence> > <method name="1"> > <device name="manual" nodename="host2"/> > </method> > </fence> > </clusternode> > </clusternodes> > <cman dead_node_timeout="10" expected_votes="1" two_node="1"/> > <fencedevices> > <fencedevice agent="fence_manual" name="manual"/> > </fencedevices> > <rm> > <failoverdomains> > <failoverdomain name="routing_servers" ordered="1" restricted="1"> > <failoverdomainnode name="host1" priority="1"/> > <failoverdomainnode name="host2" priority="2"/> > </failoverdomain> > </failoverdomains> > <resources> > <script file="/etc/init.d/rsd" name="rsd"/> > <ip address="123.456.78.9" monitor_link="1"/> > </resources> > <service autostart="1" domain="routing_servers" name="routing_daemon" recovery="relocate"> > <ip ref="123.456.78.9"/> > <script ref="rsd"/> > </service> > </rm> > </cluster> > The cluster takes about 15-20 seconds to notice that the daemon is > down and migrate it to the other node. However, due to slow migration > and startup time, we now require the daemon on the secondary to be > active and only transfer the VIP in case it aborts on the primary. You could start by decreasing the 'status check' time by tweaking /usr/share/cluster/script.sh "status" interval: <action name="status" interval="30s" timeout="0"/> <action name="monitor" interval="30s" timeout="0"/> Change to: <action name="status" interval="10s" timeout="0"/> <action name="monitor" interval="10s" timeout="0"/> (as an example...) You can also make a wrapper script which doesn't do the stop phase of your rsd script unless it's already in a non-working state (to prevent stop-before-start that rgmanager normally does): #!/bin/bash SCR=/etc/init.d/rsd case $1 in start) # Should be a no-op if already running $SCR start exit $? ;; stop) # Don't actually stop it if it's running; just # clean it up if it's broken. This app is # safe to run on multiple nodes $SCR status if [ $? -ne 0 ]; then $SCR stop exit $? fi exit 0 ;; status) $SCR status exit $? ;; esac exit 0 (Note: rsd will have to be enabled on boot for this to work). -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster