Hi, Thanks for your mail. I herewith attaching the bonding and eth configuration files. And on the /var/log/messages during the fence operation we can get the logs updated related to network only in the node which fences the other. Server 1 NIC 1: (eth2) /etc/sysconfig/network-scripts/ifcfg-eth2 DEVICE="eth2" HWADDR="3C:D9:2B:04:2D:7A" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond0 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 NIC 4: (eth5) /etc/sysconfig/network-scripts/ifcfg-eth5 DEVICE="eth5" HWADDR="3C:D9:2B:04:2D:80" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond0 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 NIC 2: (eth3) /etc/sysconfig/network-scripts/ifcfg-eth3 DEVICE="eth3" HWADDR="3C:D9:2B:04:2D:7C" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond1 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 NIC 3: /etc/sysconfig/network-scripts/ifcfg-eth4 DEVICE="eth4" HWADDR="3C:D9:2B:04:2D:7E" NM_CONTROLLED="no" ONBOOT="yes" MASTER=bond1 SLAVE=yes USERCTL=no BOOTPROTO=none Server 1 Bond0: (Public Access) /etc/sysconfig/network-scripts/ifcfg-bond0 DEVICE=bond0 BOOTPROTO=static IPADDR=192.168.129.10 NETMASK=255.255.255.0 GATEWAY=192.168.129.1 USERCTL=no ONBOOT=yes BONDING_OPTS="miimon=100 mode=0" Server 1 Bond1: (Heartbeat) /etc/sysconfig/network-scripts/ifcfg-bond1 DEVICE=bond1 BOOTPROTO=static IPADDR=10.0.0.10 NETMASK=255.0.0.0 USERCTL=no ONBOOT=yes BONDING_OPTS="miimon=100 mode=1" On the log messages, Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: now running without any active interface ! Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for interface eth3, 1000 Mbps full duplex. Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 the new active one. Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up! Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for interface eth4, 1000 Mbps full duplex. Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: Digimer [mailto:linux@xxxxxxxxxxx] Sent: Monday, January 09, 2012 10:27 AM To: linux clustering Cc: SATHYA - IT Subject: SPAM - Re: rhel 6.2 network bonding interface in cluster environment On 01/08/2012 11:37 PM, SATHYA - IT wrote: > Hi, > > We had configured RHEL 6.2 - 2 node Cluster with clvmd + gfs2 + cman + > smb. We have 4 nic cards in the servers where 2 been configured in > bonding for heartbeat (with mode=1) and 2 been configured in bonding > for public access (with mode=0). Heartbeat network is connected > directly from server to server. Once in 3 - 4 days, the heartbeat goes > down and comes up automatically in 2 to 3 seconds. Not sure why this > down and up occurs. Because of this in cluster, one system is got fenced by other. > > Is there anyway where we can increase the time to wait for the cluster > to wait for heartbeat. Ie if the cluster can wait for 5-6 seconds even > the heartbeat fails for 5-6 seconds the node won't get fenced. Kindly > advise. "mode=1" is Active/Passive and I use it extensively with no trouble. I'm not sure where "heartbeat" comes from, but I might be missing the obvious. Can you share your bond and eth configuration files here please (as plain-text attachments)? Secondly, make sure that you are actually using that interface/bond. Run 'gethostip -d <nodename>', where "nodename" is what you set in cluster.conf. The returned IP will be the one used by the cluster. Back to the bond; A failed link would nearly instantly transfer to the backup link. So if you are going down for 2~3 seconds on both links, something else is happening. Look at syslog on both nodes around the time the last fence happened and see what logs are written just prior to the fence. That might give you a clue. -- Digimer E-Mail: digimer@xxxxxxxxxxx Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use. -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster