Hi Jakov I am running Debian Lenny 64-bit. Is that going to be a problem for me ? I think you have given me enough of a pointer - ie. I haven't configured fencing properly - to get me going again. Thanks. regards, Martin ==== Just out of interest, here are the logs: Here is the syslog from clusternode28 when I suspended clusternode30: Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: <debug> Membership Change Event Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: <info> State change: clusternode30 DOWN Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: <debug> Membership Change Event Oct 26 18:29:51 clusternode28 clurgmgrd[3980]: <debug> Membership Change Event Oct 26 18:29:51 clusternode28 fenced[16118]: fencing deferred to clusternode27 Then, on clusternode27: Oct 26 18:29:52 clusternode27 kernel: [438082.708458] dlm: closing connection to node 30 Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: <debug> Membership Change Event Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: <info> State change: clusternode30 DOWN Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: <debug> Membership Change Event Oct 26 18:29:52 clusternode27 clurgmgrd[20955]: <debug> Membership Change Event Oct 26 18:29:52 clusternode27 fenced[12749]: clusternode30 not a cluster member after 0 sec post_fail_delay Oct 26 18:29:52 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:29:52 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 26 18:29:57 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:29:57 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 26 18:30:02 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 26 18:30:02 clusternode27 fenced[12749]: fence "clusternode30" failed ... and so on ... I haven't configured fencing properly, have I ? <clusternode name="clusternode30" nodeid="30"> <multicast addr="224.0.0.1" interface="eth0:1"/> <fence> <!-- Handle fencing manually --> <method name="human"> <device name="human" nodename="hostname1"/> </method> </fence> </clusternode> When I un-suspended clusternode30 (15 hours later), cman on clusternode27 throws an error and quits: Oct 27 10:50:01 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:01 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:05 clusternode27 clurgmgrd[20955]: <debug> Membership Change Event Oct 27 10:50:05 clusternode27 clurgmgrd[20955]: <debug> Membership Change Event Oct 27 10:50:06 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:06 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:11 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:11 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:16 clusternode27 fenced[12749]: fencing node "clusternode30" Oct 27 10:50:16 clusternode27 fenced[12749]: fence "clusternode30" failed Oct 27 10:50:20 clusternode27 openais[12741]: CMAN: Joined a cluster with disallowed nodes. must die Oct 27 10:50:20 clusternode27 kernel: [496910.220602] dlm: closing connection to node 28 Oct 27 10:50:20 clusternode27 kernel: [496910.220710] dlm: closing connection to node 27 Oct 27 10:50:20 clusternode27 dlm_controld[12751]: cluster is down, exiting Oct 27 10:50:20 clusternode27 gfs_controld[12753]: groupd_dispatch error -1 errno 11 Oct 27 10:50:20 clusternode27 gfs_controld[12753]: groupd connection died Oct 27 10:50:20 clusternode27 gfs_controld[12753]: cluster is down, exiting Oct 27 10:50:47 clusternode27 ccsd[12736]: Unable to connect to cluster infrastructure after 30 seconds. -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jakov Sosic Sent: 27 October 2009 09:38 To: linux-cluster@xxxxxxxxxx Subject: Re: service state unchanged when host crashes On Mon, 26 Oct 2009 17:40:24 -0000 "Martin Waite" <Martin.Waite@xxxxxxxxxxxx> wrote: > Hi, > > I have 3 VMs running in a cluster. 4 services are defined, one of > which ("SENTINEL") is running on clusternode30. > > I then suspended clusternode30 in the VM console. Cman notices the > disappearance within a few seconds. However, the SENTINEL service > that was running is still flagged as "started". Could you please post your /var/log/messages when one node is fenced? Also, are you using Debian/Ubuntu by any chance? -- | Jakov Sosic | ICQ: 28410271 | PGP: 0x965CAE2D | ================================================================= | start fighting cancer -> http://www.worldcommunitygrid.org/ | -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster