Re: Failover root cause

Yu <songyu555@xxxxxxxxx> · Fri, 9 Nov 2012 14:40:51 +1100

Regardless what was the root cause you find. Cluster requires Ntp service to ensure all nodes have time synchronized.  So you have to fix this 5 mins difference now.

Regards
Yu

On 09/11/2012, at 11:47, Muhammad Panji <sumodirjo@xxxxxxxxx> wrote:

> Dear All,
> I have an oracle cluster on RHEL 6.2 with 2 servers. Several days ago
> the service was failover from node1 to node2. From /var/log/messages
> on node2 I only see this message :
> 
> ...
> Oct 23 12:54:19 db2svr corosync[4142]:   [TOTEM ] A processor failed,
> forming new configuration.
> Oct 23 12:54:21 db2svr corosync[4142]:   [QUORUM] Members[1]: 2
> Oct 23 12:54:21 db2svr corosync[4142]:   [TOTEM ] A processor joined
> or left the membership and a new membership was formed.
> Oct 23 12:54:21 db2svr kernel: dlm: closing connection to node 1
> Oct 23 12:54:21 db2svr rgmanager[5327]: State change: clu1 DOWN
> Oct 23 12:54:21 db2svr fenced[4193]: fencing node clu1
> ...
> 
> Googling this message " [TOTEM ] A processor failed, forming new
> configuration." I learned that it means node2 couldn't see node1 and
> then fence node1. on node1 I get this message :
> 
> Oct 23 12:50:45 db1svr rgmanager[75890]: [script] Executing
> /etc/init.d/httpd status
> Oct 23 12:56:01 db1svr kernel: imklog 4.6.2, log source = /proc/kmsg started.
> Oct 23 12:56:01 db1svr rsyslogd: [origin software="rsyslogd"
> swVersion="4.6.2" x-pid="3792" x-info="http://www.rsyslog.com";]
> (re)start
> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpuset
> Oct 23 12:56:01 db1svr kernel: Initializing cgroup subsys cpu
> Oct 23 12:56:01 db1svr kernel: Linux version 2.6.32-220.el6.x86_64
> (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.4.5 20110214
> (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011
> 
> on 12:50 rgmanager still checking the service and then it's rebooted.
> Thing that make it worse is that the date / time of both servers are
> different so that I can't compare the logs directly. Current time
> difference between both servers is around 5 minutes.
> 
> I would like to ask where to look for the cause of this failover? I
> plan to graph sar data today to see if there were bottleneck on CPU
> etc so that node1 could not send status to node2, but if no bottleneck
> on CPU or RAM etc where should I find the root cause of failover?
> thank you.
> Regards,
> 
> 
> 
> 
> 
> -- 
> Muhammad Panji
> http://www.panji.web.id
> http://www.kurungsiku.com
> 
> -- 
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster