Re: bonding

rhurst@xxxxxxxxxxxxxxxxx · Thu, 12 Apr 2007 09:52:47 -0400

I have the same hardware configuration for 11 nodes, but without any of the spurious failover events.  The main thing different I had to do was to increase the bond device count to 2 (the driver defaults to only 1), as I have mine teamed between dual tg3/e1000 ports from the mobo and PCI card.  bond0 is on a gigabit switch, while bond1 is on 100mb.  In /etc/modprobe.conf:

alias bond0 bonding

alias bond1 bonding

options bonding max_bonds=2 mode=1 miimon=100 updelay=200

alias eth0 e1000

alias eth1 e1000

alias eth2 tg3

alias eth3 tg3

So eth0/eth2 are teamed, and eth1/eth3 are teamed.  In dmesg:

e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex

bonding: bond0: making interface eth0 the new active one 0 ms earlier.

bonding: bond0: enslaving eth0 as an active interface with an up link.

bonding: bond0: enslaving eth2 as a backup interface with a down link.

tg3: eth2: Link is up at 1000 Mbps, full duplex.

tg3: eth2: Flow control is on for TX and on for RX.

bonding: bond0: link status up for interface eth2, enabling it in 200 ms.

bonding: bond0: link status definitely up for interface eth2.

e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex

bonding: bond1: making interface eth1 the new active one 0 ms earlier.

bonding: bond1: enslaving eth1 as an active interface with an up link.

bonding: bond1: enslaving eth3 as a backup interface with a down link.

bond0: duplicate address detected!

tg3: eth3: Link is up at 100 Mbps, full duplex.

tg3: eth3: Flow control is off for TX and off for RX.

bonding: bond1: link status up for interface eth3, enabling it in 200 ms.

bonding: bond1: link status definitely up for interface eth3.

$ uname -srvmpio

Linux 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64 x86_64 x86_64 GNU/Linux

$ cat /proc/net/bonding/bond0

Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth0

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 200

Down Delay (ms): 0

Slave Interface: eth0

MII Status: up

Link Failure Count: 0

Permanent HW addr: 00:11:0a:5f:1e:0a

Slave Interface: eth2

MII Status: up

Link Failure Count: 0

Permanent HW addr: 00:17:a4:a7:9a:54

$ cat /proc/net/bonding/bond1

Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005)

Bonding Mode: fault-tolerance (active-backup)

Primary Slave: None

Currently Active Slave: eth1

MII Status: up

MII Polling Interval (ms): 100

Up Delay (ms): 200

Down Delay (ms): 0

Slave Interface: eth1

MII Status: up

Link Failure Count: 0

Permanent HW addr: 00:11:0a:5f:1e:0b

Slave Interface: eth3

MII Status: up

Link Failure Count: 0

Permanent HW addr: 00:17:a4:a7:9a:53

On Thu, 2007-04-12 at 08:45 -0400, Scott McClanahan wrote:

I have every node in my four node cluster setup to do active-backup
bonding and the drivers loaded for the bonded network interfaces vary
between tg3 and e100.  All interfaces with the e100 driver loaded report
errors much like what you see here:

bonding: bond0: link status definitely down for interface eth2,
disabling it
e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex
bonding: bond0: link status definitely up for interface eth2.

This happens all day on every node.  I have configured the bonding
module to do MII link monitoring at a frequency of 100 milliseconds and
it is using basic carrier link detection to test if the interface is
alive or not.  There was no custom building of any modules on these
nodes and the o/s is CentOS 4.3.

Some more relevant information is below (this display is consistent
across all nodes):

[smccl@tf35 ~]$uname -srvmpio
Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
i386 GNU/Linux

[smccl@tf35 ~]$head -5 /etc/modprobe.conf
alias bond0 bonding
options bonding miimon=100 mode=1
alias eth0 tg3
alias eth1 tg3
alias eth2 e100

[smccl@tf35 ~]$cat /proc/net/bonding/bond0 
Ethernet Channel Bonding Driver: v2.6.1 (October 29, 2004)

Bonding Mode: fault-tolerance (active-backup)
Primary Slave: None
Currently Active Slave: eth0
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 0
Down Delay (ms): 0

Slave Interface: eth0
MII Status: up
Link Failure Count: 0
Permanent HW addr: 00:10:18:0c:86:a4

Slave Interface: eth2
MII Status: up
Link Failure Count: 12
Permanent HW addr: 00:02:55:ac:a2:ea

Any idea why these e100 links report failures so often?  They are
directly plugged into a Cisco Catalyst 4506.  Thanks.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

Robert Hurst, Sr. Caché Administrator

Beth Israel Deaconess Medical Center

1135 Tremont Street, REN-7

Boston, Massachusetts   02120-2140

617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154

Any technology distinguishable from magic is insufficiently advanced.

Attachment:
smime.p7s

Description: S/MIME cryptographic signature
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster