More CS4 fencing fun

Matteo Catanese <m.catanese@xxxxxxxxxxxxx> · Tue, 7 Mar 2006 17:04:06 +0100

Hi, im doing failover tests on a CS4 cluster.

I have 2 HP dl380 + HP msa1000 (aka dl380 packaged cluster).

I already read this post
https://www.redhat.com/archives/linux-cluster/2006-January/msg00195.html

Im clustering a single oracle instance using active/passive. I don't  
use GFS.

I use fence_ilo

I have a fully working clustered oracle, i tried to migrate oracle  
instance from a node to another using system-config-cluster and  
everything works perfectly.

I tried some more "rude" failover tests with this setup:

node1 = active node
node2 = passive node

and those are the results:

Situation 1:

I rudely disconnect the powercable(s) from node1, so that node1 is  
_completely_ turned off, no current flows in it. ILO is down.

I have redundant powerunits but i wanted to simulate short circuit or  
motherboard failure

Node2, using fence, tries to poweroff node1

Fence_ilo tries to connect to node1_ilo_ip_address, but ilo is down  
because of power failure so fencing fails and starts retrying forever.

Result: One node perfectly up but cluster service stalled

Situation2:

I push the on/off button on node1. It  stops in 4 seconds, but power  
is still on, so ILO is up and working.

node2, using fence, tries to poweroff the node1.

ilo is working so fence_ilo correctly connects to  
node1_ilo_ip_address, it tries for some time to poweroff the already  
poweroff'd server, then it finally decides that server  is off.

Oracle is STILL down, no virtual ip, no storage mounted bla bla bla

Now node2 tries to wake up the turned_off_but_still_powered_ node1.

Node1 wakes up, then it does bootstrap (cluster is still stalled)  
then joins fence_domain. Fence on node2 completes succesfully and  
unlocks cluster and everything is up again

Switch time: 55 seconds (+ oracle startup time).

Situation 3:

This is not a real failover test.

Everything is off. I turn on the msa1000 and wait for its bootstrap.  
Then i turn on node1 but i still have node2 electrically disconnected.

Node1 tries to turn on node2 to complete the fence_domain, node2 is  
disconnected from power current so it will never wake up.

Cluster is stalled

Can you change fence behaviour to be less "radical" ?

If ILO is unreachable means that machine is already off and could not  
be powered on so fence shold spit out a warning and let the failover  
happen

If ILO is reachable then check its status to avoid pointless poweroff/ 
poweron

As of today fence is really dangerous in a production environment,  
for now i will turn it off

Matteo

--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster