Split Brain

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi

I have a problem with mi current cluster.
we have a 2 node cluster ( DL385 G2 we not have external storage) with RedHat 4 u5 and cluster suite 4 u5

When the nodes loose comunication, we get 2 cluster instances with de service up in both :( .. too bad. I don't undertand why one node not try to fence the other node, before form the cluster.

this are the log:
==================================================
Jan 20 22:17:42 node1 kernel: bonding: bond0: link status definitely up for interface eth0. Jan 20 22:17:48 node1 clurgmgrd: [4081]: <info> Executing /home/app/myservice.sh status Jan 20 22:17:48 node1 su(pam_unix)[11307]: session opened for user app_usr by (uid=0)
Jan 20 22:17:48 node1 su(pam_unix)[11307]: session closed for user app_usr
Jan 20 22:18:18 node1 clurgmgrd: [4081]: <info> Executing /home/app/myservice.sh status Jan 20 22:18:18 node1 su(pam_unix)[11533]: session opened for user app_usr by (uid=0)
Jan 20 22:18:18 node1 su(pam_unix)[11533]: session closed for user app_usr
Jan 20 22:18:33 node1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link is Down Jan 20 22:18:33 node1 kernel: bonding: bond0: link status definitely down for interface eth2, disabling it Jan 20 22:18:33 node1 kernel: bonding: bond0: making interface eth0 the new active one. Jan 20 22:18:37 node1 kernel: e1000: eth2: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex Jan 20 22:18:37 node1 kernel: bonding: bond0: link status definitely up for interface eth2.
Jan 20 22:18:43 node1 kernel: bnx2: eth0 NIC Link is Down
Jan 20 22:18:43 node1 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it Jan 20 22:18:43 node1 kernel: bonding: bond0: making interface eth2 the new active one. Jan 20 22:18:46 node1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex Jan 20 22:18:46 node1 kernel: bonding: bond0: link status definitely up for interface eth0. Jan 20 22:19:03 node1 kernel: CMAN: removing node node2 from the cluster : Missed too many heartbeats
Jan 20 22:19:05 node1 clurgmgrd[4081]: <info> Magma Event: Membership Change
Jan 20 22:19:05 node1 clurgmgrd[4081]: <info> State change: node2 DOWN
Jan 20 22:19:06 node1 clurgmgrd: [4081]: <info> Executing /home/app/myservice.sh status Jan 20 22:19:06 node1 su(pam_unix)[11780]: session opened for user app_usr by (uid=0)
Jan 20 22:19:06 node1 su(pam_unix)[11780]: session closed for user app_usr
Jan 20 22:19:22 node1 kernel: bnx2: eth0 NIC Link is Down
Jan 20 22:19:22 node1 kernel: bonding: bond0: link status definitely down for interface eth0, disabling it Jan 20 22:19:25 node1 kernel: bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex Jan 20 22:19:25 node1 kernel: bonding: bond0: link status definitely up for interface eth0. Jan 20 22:19:40 node1 clurgmgrd: [4081]: <info> Executing /home/app/myservice.sh status Jan 20 22:19:40 node1 su(pam_unix)[12037]: session opened for user app_usr by (uid=0)
Jan 20 22:19:40 node1 su(pam_unix)[12037]: session closed for user app_usr
Jan 20 22:20:10 node1 clurgmgrd: [4081]: <info> Executing /home/app/myservice.sh status Jan 20 22:20:10 node1 su(pam_unix)[12236]: session opened for user app_usr by (uid=0)
Jan 20 22:20:10 node1 su(pam_unix)[12236]: session closed for user app_usr
Jan 20 22:20:40 node1 clurgmgrd: [4081]: <info> Executing /home/app/myservice.sh status Jan 20 22:20:40 node1 su(pam_unix)[12461]: session opened for user app_usr by (uid=0)
Jan 20 22:20:40 node1 su(pam_unix)[12461]: session closed for user app_usr
=====================================================================

Node 2
=====================================================================
Jan 20 22:10:22 node2 sshd(pam_unix)[22703]: session opened for user app_usr by (uid=0)
Jan 20 22:10:22 node2 sshd(pam_unix)[22703]: session closed for user app_usr
Jan 20 22:10:24 node2 sshd(pam_unix)[22741]: session opened for user app_usr by (uid=0)
Jan 20 22:10:24 node2 sshd(pam_unix)[22741]: session closed for user app_usr
Jan 20 22:20:07 node2 sshd(pam_unix)[23541]: session opened for user app_usr by (uid=0)
Jan 20 22:20:07 node2 sshd(pam_unix)[23541]: session closed for user app_usr
Jan 20 22:20:09 node2 sshd(pam_unix)[23578]: session opened for user app_usr by (uid=0)
Jan 20 22:20:09 node2 sshd(pam_unix)[23578]: session closed for user app_usr
Jan 20 22:21:38 node2 kernel: CMAN: removing node node1 from the cluster : Missed too many heartbeats
Jan 20 22:21:40 node2 clurgmgrd[4177]: <info> Magma Event: Membership Change
Jan 20 22:21:40 node2 clurgmgrd[4177]: <info> State change: node1 DOWN
Jan 20 22:21:41 node2 clurgmgrd[4177]: <notice> Taking over service myservice from down member (null) Jan 20 22:21:41 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 10.10.65.1 to bond0 Jan 20 22:21:42 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 10.10.65.10 to bond0 Jan 20 22:21:43 node2 clurgmgrd: [4177]: <info> Executing /home/app/myservice.sh start Jan 20 22:21:43 node2 su(pam_unix)[23855]: session opened for user app_usr by (uid=0)
Jan 20 22:21:43 node2 su(pam_unix)[23855]: session closed for user app_usr
Jan 20 22:21:43 node2 clurgmgrd: [4177]: <info> Adding IPv4 address 10.10.70.20 to bond1
Jan 20 22:21:44 node2 clurgmgrd[4177]: <notice> Service myservice started
Jan 20 22:21:50 node2 clurgmgrd: [4177]: <info> Executing /home/app/myservice.sh status Jan 20 22:21:50 node2 su(pam_unix)[24022]: session opened for user app_usr by (uid=0)
Jan 20 22:21:50 node2 su(pam_unix)[24022]: session closed for user app_usr
Jan 20 22:22:20 node2 clurgmgrd: [4177]: <info> Executing /home/app/myservice.sh status Jan 20 22:22:20 node2 su(pam_unix)[24244]: session opened for user app_usr by (uid=0)
Jan 20 22:22:20 node2 su(pam_unix)[24244]: session closed for user app_usr
Jan 20 22:22:50 node2 clurgmgrd: [4177]: <info> Executing /home/app/myservice.sh status Jan 20 22:22:50 node2 su(pam_unix)[24469]: session opened for user app_usr by (uid=0)
=================================================================

I have configured the fences device and the power off work fine .... when I power up the machine the first en startup "fenced" the other and startup continue ok


Any help will by apreciated ..
Sorry for my bad inglish
Luis G.


--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux