Bonded heartbeat channels on RH Cluster Suite v3

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello all,

I'm experiencing a weird behaviour on RHCSv3 that I don't know if it is my mistake.

The configuration is like this:
* 2-node RHCS Cluster (not GFS);
* two onboard NICS are channel bonded (bond0) for corporate network access;
* one offboard NIC is used for cluster network heartbeating.

Since each node is located on two separate buildings, the customer wanted to channel bond the heartbeat channel, also. There has been some Ethernet switch problems in the heartbeat channel before.

So we tried to add another bonded hannel (bond1) to the setup so that we have a redundant heartbeat channel.

The setup went like this (sorry for the ASCII art):
+---------+      +----------+      +---------+
|         |----->|          |<-----|         |
| server1 |bond0 | ethernet | bond0| server2 |
|         |----->| switch   |<-----|         |
|         |      |          |      |         |
|         |----->+----------+<-----|         |
|         |bond1              bond1|         |
|         |<----crossover cable--->|         |
+---------+                        +---------+

For bond1 the customer wanted the following:
* to use the same ethernet switch of the corporate network, since it is fully redundant (each cable plugged to a differente physical switch); * to use a crossover cable for the redundant connection of bond1, just in case the whole ethernet switch solution goes down. The crossover cable here is an optical fiber passed on between the buildings; * beartbeat IP address for each server are 10.1.1.3 (clu_server1) and 10.1.1.4 (clu_server2).

  - /etc/modules.conf:
alias bond0 bonding
options bond0 -o bond0 mode=1 miimon=100
alias bond1 bonding
options bond1 -o bond1 mode=1 miimon=100

  - /etc/sysconfig/network-scripts/ifcfg-bond1:
DEVICE=bond1
ONBOOT=yes
IPADDR=10.1.1.XXX
NETMASK=255.255.255.0
BOOTPROTO=none
TYPE=Bonding

  - /etc/sysconfig/network-scripts/ifcfg-eth2:
DEVICE=eth2
ONBOOT=yes
BOOTPROTO=none
MASTER=bond1
SLAVE=yes
TYPE=Ethernet

  - /etc/sysconfig/network-scripts/ifcfg-eth3:
DEVICE=eth3
ONBOOT=yes
BOOTPROTO=none
MASTER=bond1
SLAVE=yes
TYPE=Ethernet


Now the problem is that the Cluster didn't come up, and we could get some warning in the logs:
00:19:25 server1 clumembd[11041]: <notice> Member clu_server1 UP
00:19:28 server1 clumembd[11041]: <warning> Dropping connect from 10.1.1.4: Not in subnet! 00:19:29 server1 cluquorumd[11039]: <warning> Dropping connect from 10.1.1.4: Not in subnet!
00:19:31 server1 cluquorumd[11039]: <notice> IPv4 TB @ 10.0.4.196 Online

00:18:59 server2 clumembd[17634]: <notice> Member clu_server1 UP
00:19:09 server2 clumembd[17634]: <warning> Dropping connect from 10.1.1.3: Not in subnet!
00:19:11 server2 clumembd[17634]: <notice> Member clu_server2 UP
00:19:19 server2 cluquorumd[17632]: <notice> IPv4 TB @ 10.0.4.196 Online

It seems that both servers "see" each other, both "see" the IPv4 Tiebraker as Online, but they refuse to form quorum.

Removing the "bond1" configuration made the cluster come back to normal functions, but now we cant't understand what we did wrong here.


Is this mixed switch+crossover setup for channel bonding wrong?

Please, tell me if someone got any mistake from our part, ok?

Thank you all.

Regards,

Celso.
--
*Celso Kopp Webber*

celso@xxxxxxxxxxxxxxxx <mailto:celso@xxxxxxxxxxxxxxxx>

*Webbertek - Opensource Knowledge*
(41) 8813-1919
(41) 3284-3035

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux