Re: rhel 6.2 network bonding interface in cluster environment

Digimer <linux@xxxxxxxxxxx> · Mon, 09 Jan 2012 00:24:10 -0500

On 01/09/2012 12:12 AM, SATHYA - IT wrote:
> Hi,
> 
> Thanks for your mail. I herewith attaching the bonding and eth configuration
> files. And on the /var/log/messages during the fence operation we can get
> the logs updated related to network only in the node which fences the other.

What IPs do the node names resolve to? I'm assuming bond1, but I would
like you to confirm.

> Server 1 Bond1: (Heartbeat)

I'm still not sure what you mean by heartbeat. Do you mean the channel
corosync is using?

> On the log messages, 
> 
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
> Down
> Jan  3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
> Down

This tells me both links dropped at the same time. These messages are
coming from below the cluster though.

> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
> for interface eth3, disabling it
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any
> active interface !
> Jan  3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down
> for interface eth4, disabling it

With both of the bond's NICs down, the bond itself is going to drop.

> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is
> Up, 1000 Mbps full duplex, receive & transmit flow control ON
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
> interface eth3, 1000 Mbps full duplex.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the
> new active one.
> Jan  3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up!
> Jan  3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is
> Up, 1000 Mbps full duplex, receive & transmit flow control ON
> Jan  3 14:46:10 filesrv2 kernel: bond1: link status definitely up for
> interface eth4, 1000 Mbps full duplex.

I don't see any messages about the cluster in here, which I assume you
cropped out. In this case, it doesn't matter as the problem is well
below the cluster, but in general, please provide more data, not less.
You never know what might help. :)

Anyway, you need to sort out what is happening here. Bad drivers? Bad
card (assuming dual-port)? Something is taking the NICs down, as though
they were actually unplugged.

If you can run them through a switch, if might help isolate which node
is causing the problems as then you would only see one node record "NIC
Copper Link is Down" and can then focus on just that node.

-- 
Digimer
E-Mail:              digimer@xxxxxxxxxxx
Freenode handle:     digimer
Papers and Projects: http://alteeve.com
Node Assassin:       http://nodeassassin.org
"omg my singularity battery is dead again.
stupid hawking radiation." - epitron

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster