Hi, Herewith attaching the /var/log/messages of both the servers. Yesterday (08th Jan) one of the server got fenced by other around 10:48 AM. I am also attaching the cluster.conf file for your reference. On the related note, related to heartbeat - I am referring the channel used by corosync. And the name which has been configured in cluster.conf file resolves with bond1 only. Related to the network card, we are using 2 dual port card where we configured 1 port from each for bond0 and 1 port from the other for bond1. So it doesn't seems be a network card related issue. Moreover, we are not having any errors related to bond0. Thanks Sathya Narayanan V Solution Architect -----Original Message----- From: Digimer [mailto:linux@xxxxxxxxxxx] Sent: Monday, January 09, 2012 10:54 AM To: SATHYA - IT Cc: 'linux clustering' Subject: SPAM - Re: rhel 6.2 network bonding interface in cluster environment On 01/09/2012 12:12 AM, SATHYA - IT wrote: > Hi, > > Thanks for your mail. I herewith attaching the bonding and eth > configuration files. And on the /var/log/messages during the fence > operation we can get the logs updated related to network only in the node which fences the other. What IPs do the node names resolve to? I'm assuming bond1, but I would like you to confirm. > Server 1 Bond1: (Heartbeat) I'm still not sure what you mean by heartbeat. Do you mean the channel corosync is using? > On the log messages, > > Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper > Link is Down Jan 3 14:46:07 filesrv2 kernel: bnx2 0000:03:00.1: eth3: > NIC Copper Link is Down This tells me both links dropped at the same time. These messages are coming from below the cluster though. > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status > definitely down for interface eth3, disabling it Jan 3 14:46:07 > filesrv2 kernel: bonding: bond1: now running without any active > interface ! > Jan 3 14:46:07 filesrv2 kernel: bonding: bond1: link status > definitely down for interface eth4, disabling it With both of the bond's NICs down, the bond itself is going to drop. > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:03:00.1: eth3: NIC Copper > Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth3, 1000 Mbps full duplex. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: making interface eth3 > the new active one. > Jan 3 14:46:10 filesrv2 kernel: bonding: bond1: first active interface up! > Jan 3 14:46:10 filesrv2 kernel: bnx2 0000:04:00.0: eth4: NIC Copper > Link is Up, 1000 Mbps full duplex, receive & transmit flow control ON > Jan 3 14:46:10 filesrv2 kernel: bond1: link status definitely up for > interface eth4, 1000 Mbps full duplex. I don't see any messages about the cluster in here, which I assume you cropped out. In this case, it doesn't matter as the problem is well below the cluster, but in general, please provide more data, not less. You never know what might help. :) Anyway, you need to sort out what is happening here. Bad drivers? Bad card (assuming dual-port)? Something is taking the NICs down, as though they were actually unplugged. If you can run them through a switch, if might help isolate which node is causing the problems as then you would only see one node record "NIC Copper Link is Down" and can then focus on just that node. -- Digimer E-Mail: digimer@xxxxxxxxxxx Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron This communication may contain confidential information. If you are not the intended recipient it may be unlawful for you to read, copy, distribute, disclose or otherwise use the information contained within this communication.. Errors and Omissions may occur in the contents of this Email arising out of or in connection with data transmission, network malfunction or failure, machine or software error, malfunction, or operator errors by the person who is sending the email. Precision Group accepts no responsibility for any such errors or omissions. The information, views and comments within this communication are those of the individual and not necessarily those of Precision Group. All email that is sent from/to Precision Group is scanned for the presence of computer viruses, security issues and inappropriate content. However, it is the recipient's responsibility to check any attachments for viruses before use.
Attachment:
cluster.conf
Description: Binary data
Attachment:
messages_filesrv1
Description: Binary data
Attachment:
messages_filesrv2
Description: Binary data
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster