HI Jeff & Lon,
Thanks for the reply.
Regarding the didn't failover issue (just displayed the "Owner -->
unknown" and "State --> started" but actually none services were
available), i checked the log and agreed that it should be the
fence_manual problem. It is because the log message showed that the
fence_manul was waiting node2 to rejoin the cluster, as soon as i
executed the command: fence_ack_manual -n node2, the failed services
failover to node1, all failed service back to normal.
I would like to know if there is any solution or workaround for this
situation other than buying a fence device :) ????? Can i remove the
fence.rpm ??? Will it cause any extra problems????? It is because in
production environment, we never know when will the machine down and
cannot execute the fence_ack_manual command immediately.
========/var/log/messages======
kernel: CMAN: removing node node2 from the cluster : Missed too many
heartbeats
fenced[2447]: node2 not a cluster member after 0 sec post_fail_delay
fenced[2447]: fencing node "node2"
fence_manual: Node node2 needs to be reset before recovery can procede.
Waiting for node2 to rejoin the cluster or for manual acknowledgement
that it has been reset (i.e. fence_ack_manual -n node2)
=======END================
Regarding the monitor_link issue, i have tried to set the "monitor_link
=1 " for both resource ip i.e. 192.168.0.111 and 192.168.0.112 , then i
shutdown eth0 of node2 and re-enable it, when i tried to restart the
rgmanager in node2 i.e. the failed node, it still showing the msg
"Shutting down Cluster Service Manager... Waiting for services to stop:
", i have to kill the rgmanager's processes or even worse i have to
reset the machine. Any ideas??
One more thing is even the monitor_link=0 in the cluster.conf, the
system-config-cluster --> Resource --> IP address's Monitor Link box is
being ticked!!! Why??
Many thanks,
Dicky
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster