Re: Problem with fenced on cluster with 2 BladeCentermachines: 1st machine is remove physically. The remaining one doesnot became Active (waiting for fenced)

James Parsons <jparsons@xxxxxxxxxx> · Thu, 12 Jul 2007 11:43:31 -0400

Thistle, Scott wrote:

I am having the same issue. If a blade is not present (i.e. removed for
maintenance), the fence_bladecenter cannot check the state as it is
reported empty. I think it is something simple to fix for those versed
in perl. Normally the fence only runs against a blade that is present.
If the blade is removed while running, you run into this issue.

I believe this is what you want to happen...if state cannot be checked, 
fenced keeps trying. How could you determine it was safe to stop without 
persisting some value like the number of fence tries, and trying to 
reason out whether it was safe to stop? This will not happen if you 
remove the blade from the cluster before physically removing it. It is a 
snap to do this  with one of the UIs, if you are not prejudiced against 
UIs :).

Also, removing the node from cluster membership before jerking it out of 
the rack tells rgmanager to move any services off of it  - rather than 
having to depend on heartbeat failure to make this happen.

That said, if the blade catches fire and a cage IT guy notices and jerks 
it quick, (using his IT Oven Mitt, of course) it is silly for fenced to 
keep incessantly trying when the thing no longer even exists. Perhaps 
the correct solution would be to have the fence_bladecenter report 
success if the bladecenter admin unit reports that 'no status is 
available' for a particular blade - obviously if the thing is not there, 
it should be safe to say it is fenced :)

If this addresses your situation (I think it does), now would be a 
REALLY good time to file a ticket requesting
this behavior - like today! I'll post a fixed version to the ticket when 
it is ready.

Thanks to Lon for discussing this with me...;)

Regards,

-Jim

My case below. Blade #3 is a good node. Blade #2 was removed. The fence
does not work with the blade removed.

system> env -T system:blade[3]
OK
system:blade[3]> power -state
On
system:blade[3]> env -T system:blade[2]
The target bay is empty. 
system:blade[3]> env -T system:blade[1]
OK
system:blade[1]>

-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of James Parsons
Sent: Thursday, July 12, 2007 12:33 PM
To: linux clustering
Subject: Re:  Problem with fenced on cluster with 2
BladeCentermachines: 1st machine is remove physically. The remaining one
doesnot became Active (waiting for fenced)

catalin.lupescu@xxxxxxxx wrote:

Hello!

I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center 
chassis.
(fenced version 1.32.6)

I have done the following test:
I have removed physically the node 1 machine (the Active one).
The second one is never became active one. "Clustat" command does not 
printing any information.
In /var/log/messages we can found the following messages (repeated):

Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" 
reports: pattern match timed-out at /sbin/fence_bladecenter line 185 
Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed

If the node 1 is plugged, the node 2 became Active (fenced OK)

bz#240509 changed the sleep timeout in the bladecenter agent from 5 to
10...this is on or about line 193 in /sbin/fence_bladecenter.  See what
yours is set at, and try pushing it out a bit. This minor change is
making its way through the distribution chain now.

-j

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster