Thistle, Scott wrote:
I am having the same issue. If a blade is not present (i.e. removed for
maintenance), the fence_bladecenter cannot check the state as it is
reported empty. I think it is something simple to fix for those versed
in perl. Normally the fence only runs against a blade that is present.
If the blade is removed while running, you run into this issue.
I believe this is what you want to happen...if state cannot be checked,
fenced keeps trying. How could you determine it was safe to stop without
persisting some value like the number of fence tries, and trying to
reason out whether it was safe to stop? This will not happen if you
remove the blade from the cluster before physically removing it. It is a
snap to do this with one of the UIs, if you are not prejudiced against
UIs :).
Also, removing the node from cluster membership before jerking it out of
the rack tells rgmanager to move any services off of it - rather than
having to depend on heartbeat failure to make this happen.
That said, if the blade catches fire and a cage IT guy notices and jerks
it quick, (using his IT Oven Mitt, of course) it is silly for fenced to
keep incessantly trying when the thing no longer even exists. Perhaps
the correct solution would be to have the fence_bladecenter report
success if the bladecenter admin unit reports that 'no status is
available' for a particular blade - obviously if the thing is not there,
it should be safe to say it is fenced :)
If this addresses your situation (I think it does), now would be a
REALLY good time to file a ticket requesting
this behavior - like today! I'll post a fixed version to the ticket when
it is ready.
Thanks to Lon for discussing this with me...;)
Regards,
-Jim
My case below. Blade #3 is a good node. Blade #2 was removed. The fence
does not work with the blade removed.
system> env -T system:blade[3]
OK
system:blade[3]> power -state
On
system:blade[3]> env -T system:blade[2]
The target bay is empty.
system:blade[3]> env -T system:blade[1]
OK
system:blade[1]>
-----Original Message-----
From: linux-cluster-bounces@xxxxxxxxxx
[mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of James Parsons
Sent: Thursday, July 12, 2007 12:33 PM
To: linux clustering
Subject: Re: Problem with fenced on cluster with 2
BladeCentermachines: 1st machine is remove physically. The remaining one
doesnot became Active (waiting for fenced)
catalin.lupescu@xxxxxxxx wrote:
Hello!
I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center
chassis.
(fenced version 1.32.6)
I have done the following test:
I have removed physically the node 1 machine (the Active one).
The second one is never became active one. "Clustat" command does not
printing any information.
In /var/log/messages we can found the following messages (repeated):
Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter"
reports: pattern match timed-out at /sbin/fence_bladecenter line 185
Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed
If the node 1 is plugged, the node 2 became Active (fenced OK)
bz#240509 changed the sleep timeout in the bladecenter agent from 5 to
10...this is on or about line 193 in /sbin/fence_bladecenter. See what
yours is set at, and try pushing it out a bit. This minor change is
making its way through the distribution chain now.
-j
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster