We ran into the same problem. We ended up writing a new fence_ilo method. It sends a power reset via ILO. With power reset, if the node is powered down, nothing happens, and if it is powered up, it is powered down then up. Look at the HP ILO interface. HP has a PDF document about the ILO interface. Take a look at it.
Coman
Coman
----- Original Message ----
From: Miroslav Zubcic <mvz+rhcluster@xxxxxxxxx>
To: linux-cluster@xxxxxxxxxx
Sent: Thursday, December 13, 2007 4:55:11 AM
Subject: fence_ilo confused if both power supplies die
Hi all,
Is this a bug? Should we report it on official RHN (I hate that slow
buggy oracle based portal!)
Summary:
We have 2-node cluster on HP ProLiant DL 380 G5 servers.
3 services in cluster:
- FreeRADIUS + IP addr
- Apache + IP addr + storage LUN
- Postgres + IP addr + storage LUN
Fencing is done via HP ILO cards.
Couple days ago, both power supplies on one node died in short time
(well, obviously it can happen). Fenced daemon, ccsd, and cluster
generaly didn't reacted well on that, despite surviving non-real-life
acceptance tests where we pulled both power supplies out in test. Faulty
power supply is something different than missing power supply for HP ILO
card. ILO card continued to work on it's internal battery but "POWER ON"
action did not suceeded (POWER command was returning that power is off).
This situation has confused fence_ilo agent. Agent has seen that other
server is down, but it never returned sucess to cluster because it
FAILED TO POWER ON other server.
I think this is buggy behaviour. Who cares if fence agent cannot power
on again fenced node, why it just didn't give up? Here is relevant part
of the log on healthy node which tried to fence other node.
Dec 10 03:37:14 aoc01 kernel: CMAN: removing node aoc02 from the cluster
: Missed too many heartbeats
Dec 10 03:37:14 aoc01 fenced[3012]: aoc02 not a cluster member after 0
sec post_fail_delay
Dec 10 03:37:14 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:50 aoc01 fenced[3012]: agent "fence_ilo" reports: failed to
turn on
Dec 10 03:37:50 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:37:55 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:55 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:37:55 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 03:37:55 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:38:00 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:38:00 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:38:00 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:13 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:18 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:18 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:18 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:18 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:23 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:23 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:23 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:23 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:28 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:28 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:28 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
--
Miroslav
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
From: Miroslav Zubcic <mvz+rhcluster@xxxxxxxxx>
To: linux-cluster@xxxxxxxxxx
Sent: Thursday, December 13, 2007 4:55:11 AM
Subject: fence_ilo confused if both power supplies die
Hi all,
Is this a bug? Should we report it on official RHN (I hate that slow
buggy oracle based portal!)
Summary:
We have 2-node cluster on HP ProLiant DL 380 G5 servers.
3 services in cluster:
- FreeRADIUS + IP addr
- Apache + IP addr + storage LUN
- Postgres + IP addr + storage LUN
Fencing is done via HP ILO cards.
Couple days ago, both power supplies on one node died in short time
(well, obviously it can happen). Fenced daemon, ccsd, and cluster
generaly didn't reacted well on that, despite surviving non-real-life
acceptance tests where we pulled both power supplies out in test. Faulty
power supply is something different than missing power supply for HP ILO
card. ILO card continued to work on it's internal battery but "POWER ON"
action did not suceeded (POWER command was returning that power is off).
This situation has confused fence_ilo agent. Agent has seen that other
server is down, but it never returned sucess to cluster because it
FAILED TO POWER ON other server.
I think this is buggy behaviour. Who cares if fence agent cannot power
on again fenced node, why it just didn't give up? Here is relevant part
of the log on healthy node which tried to fence other node.
Dec 10 03:37:14 aoc01 kernel: CMAN: removing node aoc02 from the cluster
: Missed too many heartbeats
Dec 10 03:37:14 aoc01 fenced[3012]: aoc02 not a cluster member after 0
sec post_fail_delay
Dec 10 03:37:14 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:50 aoc01 fenced[3012]: agent "fence_ilo" reports: failed to
turn on
Dec 10 03:37:50 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:37:55 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:37:55 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:37:55 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 03:37:55 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 03:38:00 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 03:38:00 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 03:38:00 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:13 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:18 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:18 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:18 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:18 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:23 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:23 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:23 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
Dec 10 05:42:23 aoc01 fenced[3012]: fence "aoc02" failed
Dec 10 05:42:28 aoc01 fenced[3012]: fencing node "aoc02"
Dec 10 05:42:28 aoc01 ccsd[2896]: process_get: Invalid connection
descriptor received.
Dec 10 05:42:28 aoc01 ccsd[2896]: Error while processing get: Invalid
request descriptor
--
Miroslav
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
Ask a question on any topic and get answers from real people. Go to Yahoo! Answers.
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster