Hi, im doing failover tests on a CS4 cluster.
I have 2 HP dl380 + HP msa1000 (aka dl380 packaged cluster).
I already read this post
https://www.redhat.com/archives/linux-cluster/2006-January/msg00195.html
Im clustering a single oracle instance using active/passive. I don't
use GFS.
I use fence_ilo
I have a fully working clustered oracle, i tried to migrate oracle
instance from a node to another using system-config-cluster and
everything works perfectly.
I tried some more "rude" failover tests with this setup:
node1 = active node
node2 = passive node
and those are the results:
Situation 1:
I rudely disconnect the powercable(s) from node1, so that node1 is
_completely_ turned off, no current flows in it. ILO is down.
I have redundant powerunits but i wanted to simulate short circuit or
motherboard failure
Node2, using fence, tries to poweroff node1
Fence_ilo tries to connect to node1_ilo_ip_address, but ilo is down
because of power failure so fencing fails and starts retrying forever.
Result: One node perfectly up but cluster service stalled
Situation2:
I push the on/off button on node1. It stops in 4 seconds, but power
is still on, so ILO is up and working.
node2, using fence, tries to poweroff the node1.
ilo is working so fence_ilo correctly connects to
node1_ilo_ip_address, it tries for some time to poweroff the already
poweroff'd server, then it finally decides that server is off.
Oracle is STILL down, no virtual ip, no storage mounted bla bla bla
Now node2 tries to wake up the turned_off_but_still_powered_ node1.
Node1 wakes up, then it does bootstrap (cluster is still stalled)
then joins fence_domain. Fence on node2 completes succesfully and
unlocks cluster and everything is up again
Switch time: 55 seconds (+ oracle startup time).
Situation 3:
This is not a real failover test.
Everything is off. I turn on the msa1000 and wait for its bootstrap.
Then i turn on node1 but i still have node2 electrically disconnected.
Node1 tries to turn on node2 to complete the fence_domain, node2 is
disconnected from power current so it will never wake up.
Cluster is stalled
Can you change fence behaviour to be less "radical" ?
If ILO is unreachable means that machine is already off and could not
be powered on so fence shold spit out a warning and let the failover
happen
If ILO is reachable then check its status to avoid pointless poweroff/
poweron
As of today fence is really dangerous in a production environment,
for now i will turn it off
Matteo
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster