On 12/30/2011 10:40 PM, SATHYA - IT wrote: > Hi, > > Herewith attaching the logs and configuration files for ref. Kindly assist. > > Thanks ================= Dec 25 11:11:26 filesrv1 corosync[9061]: [TOTEM ] A processor failed, forming new configuration. Dec 25 11:11:26 filesrv1 kernel: bnx2 0000:03:00.1: eth3: NIC Copper Link is Down Dec 25 11:11:26 filesrv1 kernel: bonding: bond1: link status definitely down for interface eth3, disabling it Dec 25 11:11:26 filesrv1 kernel: bonding: bond1: making interface eth4 the new active one. Dec 25 11:11:27 filesrv1 kernel: bnx2 0000:04:00.0: eth4: NIC Copper Link is Down Dec 25 11:11:27 filesrv1 kernel: bonding: bond1: link status definitely down for interface eth4, disabling it Dec 25 11:11:27 filesrv1 kernel: bonding: bond1: now running without any active interface ! Dec 25 11:11:28 filesrv1 corosync[9061]: [QUORUM] Members[1]: 1 Dec 25 11:11:28 filesrv1 corosync[9061]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 25 11:11:28 filesrv1 rgmanager[12538]: State change: clustsrv2 DOWN Dec 25 11:11:28 filesrv1 corosync[9061]: [CPG ] chosen downlist: sender r(0) ip(10.0.0.10) ; members(old:2 left:1) Dec 25 11:11:28 filesrv1 corosync[9061]: [MAIN ] Completed service synchronization, ready to provide service. Dec 25 11:11:28 filesrv1 kernel: dlm: closing connection to node 2 Dec 25 11:11:28 filesrv1 kernel: GFS2: fsid=samba:ctdb.1: jid=0: Trying to acquire journal lock... Dec 25 11:11:28 filesrv1 kernel: GFS2: fsid=samba:gen01.1: jid=0: Trying to acquire journal lock... Dec 25 11:11:28 filesrv1 fenced[9120]: fencing node clustsrv2 ================= Do you have the servers directly connected to one another? I don't see the fence message until a full 2 seconds after the link dropped. ================= Dec 25 03:30:06 filesrv2 kernel: imklog 4.6.2, log source = /proc/kmsg started. Dec 25 03:30:06 filesrv2 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8660" x-info="http://www.rsyslog.com"] (re)start Dec 25 11:14:56 filesrv2 kernel: imklog 4.6.2, log source = /proc/kmsg started. Dec 25 11:14:56 filesrv2 rsyslogd: [origin software="rsyslogd" swVersion="4.6.2" x-pid="8811" x-info="http://www.rsyslog.com"] (re)start Dec 25 11:14:56 filesrv2 kernel: Initializing cgroup subsys cpuset Dec 25 11:14:56 filesrv2 kernel: Initializing cgroup subsys cpu Dec 25 11:14:56 filesrv2 kernel: Linux version 2.6.32-220.el6.x86_64 (mockbuild@xxxxxxxxxxxxxxxxxxxxxxxxxxxx) (gcc version 4.4.5 20110214 (Red Hat 4.4.5-6) (GCC) ) #1 SMP Wed Nov 9 08:03:13 EST 2011 Dec 25 11:14:56 filesrv2 kernel: Command line: ro root=/dev/mapper/vg_filesrv2-LogVol01 rd_LVM_LV=vg_filesrv2/LogVol01 rd_LVM_LV=vg_filesrv2/LogVol00 rd_NO_LUKS rd_NO_MD rd_NO_DM LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 KEYBOARDTYPE=pc KEYTABLE=us crashkernel=128M rhgb quiet acpi=off Dec 25 11:14:56 filesrv2 kernel: KERNEL supported cpus: ================= If this node failed, it failed hard as nothing got written to the logs. Normally with network issues, you would expect to see "failed -> fence -> network down" on the survivor and at least some portion of this on the victim. That it just flat out died tells me that something else took out the lost server, and what you saw was from the cluster is the results of recovering from that loss. I do see a crash in the second machine at 11:41:32 on Dec. 26, but there doesn't seem to be any corresponding data on the first node. Are the times in sync? Lastly, I see the '[TOTEM ] Retransmit List:' list bug on the first server, but not the second one. Are both nodes fully up to date? If they are, and if you have a RHEL subscription, it might be worth talking to your support contact. In short, you seem to have multiple issues. Not entirely sure if they're related or not, but possibly not which would make debugging tricky. Go through both servers logs (that you attached here) and look closely at these issues. Investigate them and see where that takes you. Cheers -- Digimer E-Mail: digimer@xxxxxxxxxxx Freenode handle: digimer Papers and Projects: http://alteeve.com Node Assassin: http://nodeassassin.org "omg my singularity battery is dead again. stupid hawking radiation." - epitron -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster