Thanks for the response. Sorry for the delay. I had an issue that,
unexpectedly, took me away from the office. I am just getting back to
this now.
Yes, the MAC addresses were all updated after the cloning. According to
my notes, here are sections of the log files at the time of a fence from
each cluster node.
Feb 10 15:17:48 nfs2-cluster clurgmgrd[4280]:<notice> Resource Group Manager Starting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice> Shutting down Cluster Service Manager...
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice> Shutting down
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice> Shutting down
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice> Shutdown complete, exiting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice> Cluster Service Manager is stopped.
Feb 10 15:18:23 nfs2-cluster ccsd[2989]: Stopping ccsd, SIGTERM received.
Feb 10 15:18:23 nfs2-cluster NAMC
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading all openais components
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_confdb v0 (19/10)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cpg v0 (18/8)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cfg v0 (17/7)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_msg v0 (16/6)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_lck v0 (15/5)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_evt v0 (14/4)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_amf v0 (12/2)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_clm v0 (11/1)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_evs v0 (10/0)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cman v0 (9/9)
Feb 10 15:18:23 nfs2-cluster gfs_controld[3077]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster dlm_controld[3071]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster fenced[3065]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 2
Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 1
Feb 10 15:17:34 nfs1-cluster ntpd[3765]: synchronized to LOCAL(0), stratum 10
Feb 10 15:18:17 nfs1-cluster clurgmgrd[4323]:<notice> Member 2 shutting down
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 2.
Feb 10 15:18:34 nfs1-cluster ntpd[3765]: synchronized to 132.236.56.250, stratum 2
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 0.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Creating commit token because I am the rep.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Saving state aru 230 high seq received 230
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Storing new sequence id for ring 1f80
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering COMMIT state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering RECOVERY state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] position [0] member 140.90.91.240:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] previous ring seq 8060 rep 140.90.91.240
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] aru 230 high delivered 230 received flag 1
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Did not need to originate any messages in recovery.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Sending initial ORF token
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] CLM CONFIGURATION CHANGE
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] New Configuration:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] r(0) ip(140.90.91.240)
Feb 10 15:18:35 nfs1-cluster kernel: dlm: closing connection to node 2
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Left:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] r(0) ip(140.90.91.242)
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Joined:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] CLM CONFIGURATION CHANGE
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] New Configuration:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] r(0) ip(140.90.91.240)
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Left:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] Members Joined:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [SYNC ] This node is within the primary component and will provide service.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering OPERATIONAL state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM ] got nodejoin message 140.90.91.240
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CPG ] got joinlist message from node 1
I was seeing a number of these messages but they stopped after upgrading openais
nfs2-cluster openais[3012]: [TOTEM] Retransmit List: 1df3
Yes, these are in managed switches. I will try to run the tcpdump asap. Unfortunately, that means I have to have it crash again to get what I need and my users are already annoyed by the downtime we've had. I know this isn't the best solution for our needs, but given the lack of funding, this seemed like a good idea at the time.
Thanks for the help!
Randy
On 02/14/2011 09:03 AM, Digimer wrote:
On 02/14/2011 08:53 AM, Randy Brown wrote:
Hello,
I am running a 2 node cluster being used as a NAS head for a Lefthand
Networks iSCSI SAN to provide NFS mounts out to my network. Things have
been OK for a while, but I recently lost one of the nodes as a result of
a patching problem. In an effort to recreate the failed node, I imaged
the working node and installed that image on the failed node. I set
it's hostname and IP settings correctly and the machine booted and
joined the cluster just fine. Or at least it appeared so. Things ran
OK for the last few weeks, but I recently started seeing a behavior
where the nodes start fencing each other. I'm wondering if there is
something as a result of cloning the nodes that could be the problem.
Possibly something that should be different but isn't because of the
cloning?
I am running CentOS 5.5 with the following package versions:
Kernel - 2.6.18-194.11.3.el5 #1 SMP
cman-2.0.115-34.el5_5.4
lvm2-cluster-2.02.56-7.el5_5.4
gfs2-utils-0.1.62-20.el5
kmod-gfs-0.1.34-12.el5.centos
rgmanager-2.0.52-6.el5.centos.8
I have a Qlogic qla4062 HBA in the node running: QLogic iSCSI HBA Driver
(f8b83000) v5.01.03.04
I will gladly provide more information as needed.
Thank you,
Randy
Silly question, but are the NICs mapped to their MAC addresses? If so,
did you update the MAC addresses after cloning the server to reflect the
actual MAC addresses? Assuming so, do you have managed switches? If so,
can you test by swapping out a simple, unmanaged switch?
This sounds like a multicast issue at some level. Fencing happens once
the totem ring is declared failed. Do you see anything interesting in
the log files prior to the fence? Can you run tcpdump to see what is
happening on the interface(s) prior to the fence?
begin:vcard
fn:Randy Brown
n:Brown;Randy
org:National Weather Service;Office of Hydrologic Development
adr:;;1325 East West Highway;Silver Spring;MD;20910;USA
email;internet:randy.brown@xxxxxxxx
title:Senior Systems Administrator
tel;work:301-713-1669 x110
url:http://www.nws.noaa.gov/ohd/
version:2.1
end:vcard
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster