I don't know that I'd need to increase max_bonds since I only have one bond on each node but I have considered resorting to the old MII or ETHTOOL ioctl method to determine link state. You are running a newer kernel and I haven't checked the changelog to see what differences might be pertinent but mainly you are using e1000 drivers compared to my e100 driver. I just can't seem to associate the link status failures with any other events on the box, it's really strange. On Thu, 2007-04-12 at 09:52 -0400, rhurst@xxxxxxxxxxxxxxxxx wrote: > I have the same hardware configuration for 11 nodes, but without any > of the spurious failover events. The main thing different I had to do > was to increase the bond device count to 2 (the driver defaults to > only 1), as I have mine teamed between dual tg3/e1000 ports from the > mobo and PCI card. bond0 is on a gigabit switch, while bond1 is on > 100mb. In /etc/modprobe.conf: > > alias bond0 bonding > alias bond1 bonding > options bonding max_bonds=2 mode=1 miimon=100 updelay=200 > alias eth0 e1000 > alias eth1 e1000 > alias eth2 tg3 > alias eth3 tg3 > > So eth0/eth2 are teamed, and eth1/eth3 are teamed. In dmesg: > > e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex > bonding: bond0: making interface eth0 the new active one 0 ms earlier. > bonding: bond0: enslaving eth0 as an active interface with an up link. > bonding: bond0: enslaving eth2 as a backup interface with a down link. > tg3: eth2: Link is up at 1000 Mbps, full duplex. > tg3: eth2: Flow control is on for TX and on for RX. > bonding: bond0: link status up for interface eth2, enabling it in 200 > ms. > bonding: bond0: link status definitely up for interface eth2. > e1000: eth1: e1000_watchdog_task: NIC Link is Up 100 Mbps Full Duplex > bonding: bond1: making interface eth1 the new active one 0 ms earlier. > bonding: bond1: enslaving eth1 as an active interface with an up link. > bonding: bond1: enslaving eth3 as a backup interface with a down link. > bond0: duplicate address detected! > tg3: eth3: Link is up at 100 Mbps, full duplex. > tg3: eth3: Flow control is off for TX and off for RX. > bonding: bond1: link status up for interface eth3, enabling it in 200 > ms. > bonding: bond1: link status definitely up for interface eth3. > > $ uname -srvmpio > Linux 2.6.9-42.0.10.ELsmp #1 SMP Fri Feb 16 17:13:42 EST 2007 x86_64 > x86_64 x86_64 GNU/Linux > > $ cat /proc/net/bonding/bond0 > Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005) > > Bonding Mode: fault-tolerance (active-backup) > Primary Slave: None > Currently Active Slave: eth0 > MII Status: up > MII Polling Interval (ms): 100 > Up Delay (ms): 200 > Down Delay (ms): 0 > > Slave Interface: eth0 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:11:0a:5f:1e:0a > > Slave Interface: eth2 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:17:a4:a7:9a:54 > > $ cat /proc/net/bonding/bond1 > Ethernet Channel Bonding Driver: v2.6.3 (June 8, 2005) > > Bonding Mode: fault-tolerance (active-backup) > Primary Slave: None > Currently Active Slave: eth1 > MII Status: up > MII Polling Interval (ms): 100 > Up Delay (ms): 200 > Down Delay (ms): 0 > > Slave Interface: eth1 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:11:0a:5f:1e:0b > > Slave Interface: eth3 > MII Status: up > Link Failure Count: 0 > Permanent HW addr: 00:17:a4:a7:9a:53 > > > On Thu, 2007-04-12 at 08:45 -0400, Scott McClanahan wrote: > > I have every node in my four node cluster setup to do active-backup > > bonding and the drivers loaded for the bonded network interfaces vary > > between tg3 and e100. All interfaces with the e100 driver loaded report > > errors much like what you see here: > > > > bonding: bond0: link status definitely down for interface eth2, > > disabling it > > e100: eth2: e100_watchdog: link up, 100Mbps, full-duplex > > bonding: bond0: link status definitely up for interface eth2. > > > > This happens all day on every node. I have configured the bonding > > module to do MII link monitoring at a frequency of 100 milliseconds and > > it is using basic carrier link detection to test if the interface is > > alive or not. There was no custom building of any modules on these > > nodes and the o/s is CentOS 4.3. > > > > Some more relevant information is below (this display is consistent > > across all nodes): > > > > [smccl@tf35 ~]$uname -srvmpio > > Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686 > > i386 GNU/Linux > > > > [smccl@tf35 ~]$head -5 /etc/modprobe.conf > > alias bond0 bonding > > options bonding miimon=100 mode=1 > > alias eth0 tg3 > > alias eth1 tg3 > > alias eth2 e100 > > > > [smccl@tf35 ~]$cat /proc/net/bonding/bond0 > > Ethernet Channel Bonding Driver: v2.6.1 (October 29, 2004) > > > > Bonding Mode: fault-tolerance (active-backup) > > Primary Slave: None > > Currently Active Slave: eth0 > > MII Status: up > > MII Polling Interval (ms): 100 > > Up Delay (ms): 0 > > Down Delay (ms): 0 > > > > Slave Interface: eth0 > > MII Status: up > > Link Failure Count: 0 > > Permanent HW addr: 00:10:18:0c:86:a4 > > > > Slave Interface: eth2 > > MII Status: up > > Link Failure Count: 12 > > Permanent HW addr: 00:02:55:ac:a2:ea > > > > Any idea why these e100 links report failures so often? They are > > directly plugged into a Cisco Catalyst 4506. Thanks. > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > Robert Hurst, Sr. Caché Administrator > Beth Israel Deaconess Medical Center > 1135 Tremont Street, REN-7 > Boston, Massachusetts 02120-2140 > 617-754-8754 ∙ Fax: 617-754-8730 ∙ Cell: 401-787-3154 > Any technology distinguishable from magic is insufficiently advanced. > plain text document attachment (ATT362682.txt), "ATT362682.txt" > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster