RE: service state unchanged when host crashes

"Martin Waite" <Martin.Waite@xxxxxxxxxxxx> · Wed, 28 Oct 2009 10:21:02 -0000

Hi Jakov,

I managed to get fencing working - at least enough for my experiments.  

Sure enough, I hit the same problem:

clusternode30 is running service "SENTINEL" - and then is powered down
at ~ 18:19

Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: <debug> Membership Change
Event
Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: <info> State change:
clusternode30 DOWN
Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: <debug> Membership Change
Event
Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: <debug> Membership Change
Event
Oct 27 18:19:55 clusternode27 fenced[2760]: clusternode30 not a cluster
member after 0 sec post_fail_delay
Oct 27 18:19:55 clusternode27 fenced[2760]: fencing node "clusternode30"
Oct 27 18:19:55 clusternode27 fenced[2760]: can't get node number for
node p<CA>@#001
Oct 27 18:19:55 clusternode27 fenced[2760]: fence "clusternode30"
success
Oct 27 18:19:55 clusternode27 clurgmgrd[2785]: <debug> 22 rules loaded

(The "can't get node number" looks suspicious, but fenced claims to
succeed).

Next morning - it still hasn't relocated the service:

Cluster Status for testcluster @ Wed Oct 28 11:13:55 2009
Member Status: Quorate

 Member Name                                                     ID
Status
 ------ ----                                                     ----
------
 clusternode27                                                      27
Online, Local, rgmanager
 clusternode28                                                      28
Online, rgmanager
 clusternode30                                                      30
Offline

 Service Name                                                     Owner
(Last)                                                     State
 ------- ----                                                     -----
------                                                     -----
 service:SCRIPT
clusternode28                                                    started
 service:SENTINEL
clusternode30                                                    started
 service:VIP
clusternode27                                                    started
 service:mysql_authdb_service
clusternode27                                                    started

I am going to strip my config down later on so that SENTINEL is the only
running service.   My fencing mechanism is pretty pathetic - I have
added a new fence agent that does nothing but always succeeds (which I
hope is enough for this stage in my education) - but my understanding is
that the sequence of events should be something like this:

1. <node fails>
2. cman notices  (and groupd)
3. fencing is applied to the node
4. the service is relocated - or marked as failed

regards,
Martin

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jakov Sosic
Sent: 28 October 2009 10:04
To: linux-cluster@xxxxxxxxxx
Subject: Re:  service state unchanged when host crashes

On Tue, 27 Oct 2009 09:57:50 -0000
"Martin Waite" <Martin.Waite@xxxxxxxxxxxx> wrote:

> I am running Debian Lenny 64-bit.   Is that going to be a problem for
> me ?

Well maybe. Last time I tried RedHat Cluster Suite on Debian Lenny was
two months ago, and then I had stumbled upon the following bug:

http://www.mail-archive.com/linux-cluster@xxxxxxxxxx/msg06018.html

I don't know if they have fixed that bug... but it resembles totally to
your problem... Node goes down, node gets fenced, service is seen as
down by rgmanager, but there is no action to relocate it to a live
cluster member. That was a start of a project for me, so after that I
migrated to CentOS 5 (which is a free RHEL fork).

> I think you have given me enough of a pointer - ie. I haven't
> configured fencing properly - to get me going again.  Thanks.

I can see that from the logs now :) If you get to the point where bug
that I explained earlier pops up, please share that information here so
that we know the state of RHCS on Debian.

-- 
|    Jakov Sosic    |    ICQ: 28410271    |   PGP: 0x965CAE2D   |
=================================================================
| start fighting cancer -> http://www.worldcommunitygrid.org/   |

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster