On Sun, 2004-10-24 at 01:53 +0200, Eli Elizur wrote: > I'm having problems failing over to my backup system. > Yet, when the user script sends an "exit 1" the cluster is stopping > the service but do no start it on the backup machine. In RHEL 2.1, the exit status of the 'stop' phase is very important. A nonzero exit means that the user script could not, for some reason, clean up the user service in its entirety. Because of this, the status of the user service is unknown. We do not know that the service was fully cleaned up, so it is *not* safe to relocate the service to another node (ever). In this case, there are generally two things you can do: (1) Nothing. Let other services in the cluster and on the node continue to run normally; the service will remain broken until fixed by an administrator. (2) Reboot the node to ensure that any allocated resources are cleaned up. In RHEL 2.1, we opted for (1). In that case, your user script should only ever return nonzero in the 'stop' path if it encounters something that it can fix automatically. This indicates to the cluster software that the service has failed and is in a state which can *not* be recovered automatically. If the service can be recovered, your user script must recover it and return '0' from the stop path. (This means that if the service was NOT running at the time the 'stop' phase was called, you must still return '0' from the stop path.) If you wish for behavior (2), simply change your script to run "/sbin/ reboot -fn" instead of returning a non-zero from the stop path. -- Lon