On Tue, 2006-11-07 at 12:29 -0800, aberoham@xxxxxxxxx wrote: > > Last night one of my five cluster nodes suffered a hardware failure > (memory, cpu?). The other nodes properly fenced the failed machine, > but no matter what clusvcadm command I ran, I could not get the other > cluster members to start, stop or disable the cluster resource > group/service that had been running on the failed node. (the resource > group/service that was running on the failed node includes an EXT3 fs, > an IP address, a rsyncd and a smbd init script) > > The "clusvcadm -d [service]" command would just hang for minutes and > not return. "clustat" intially reported the rg/service in an unknown > state, then stopped reporting rgmanager status and only showed cman > status. The cluster remained quorate the entire time. Resource > groups/services on non-failed nodes continued to run, but no matter > what I tried I could not get rgmanager status on any node. > > I had to reset the entire cluster to get things back to normal. (This > is a heavily used operational system so I didn't have time to do > further debugging.) My logs don't show any rgmanger related error > messages, only fencing status: > > Nov 6 20:24:37 bamf02 kernel: CMAN: removing node bamf03 from the > cluster : Missed too many heartbeats > Nov 6 20:24:38 bamf02 fenced[5913]: fencing deferred to bamf01 > --- > Nov 6 20:24:37 bamf01 kernel: CMAN: node bamf03 has been removed from > the cluster : Missed too many heartbeats > Nov 6 20:24:38 bamf01 fenced[5756]: bamf03 not a cluster member after > 0 sec post_fail_delay > Nov 6 20:24:38 bamf01 fenced[5756]: fencing node "bamf03" > Nov 6 20:24:46 bamf01 fenced[5756]: fence "bamf03" success > Nov 6 20:30:36 bamf01 sshd(pam_unix)[27244]: session opened for user > root by root(uid=0) > Nov 6 20:36:29 bamf01 kernel: CMAN: node bamf03 rejoining > Nov 6 20:42:55 bamf01 shutdown: shutting down for system reboot > --- > > I'm running RHEL4U4 (cman 1.0.11-0, cman-kernel-smp 2.6.9-45.5, dlm > 1.0.1-1, magma 1.0.6-0 rgmanager 1.9.53) on x86_64 hardware. cman_tool status ? Did rgmanager crash (service rgmanager status reported it as dead)? Was anything in dmesg indicating a DLM error? -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster