On 07/02/14 11:13 AM, Benjamin Budts wrote:
Gents,
We're not all gents. ;)
I have a 2 node setup (with quorum disk), redhat 6.5 & a luci mgmt console.
Everything has been configured and we’re doing failover tests now.
Couple of questions I have :
·When I simulate a complete power failure of a servers pdu’s (no more
access to idrac fencing or APC PDU fencing) I can see that the fencing
of that node who was running the application fails ßI noticed unless
fencing returns an OK I’m stuck and my application won’t start on my
2^nd node. Which is ok I guess, because no fencing could mean there is
still I/O on my san.
This is expected. If a lost node can't be put into a known state, there
is no safe way to proceed. To do so would be to risk a split brain at
least, and data loss/corruption at worst.
The way I deal with this is to have nodes with redundant power supplies
and use two PDUs and two UPSes. This way, the failure of on cirtcuit /
UPS / PDU doesn't knock out the power to the mainboard of the nodes, so
you don't lose IPMI.
Clustat also shows on the active node that the 1^st node is still
running the application.
That's likely because rgmanager uses DLM, and DLM blocks until the fence
succeeds, so it can't update it's view.
How can I intervene manually, so as to force a start of the application
on the node that is still alive ?
If you are *100% ABSOLUTELY SURE* that the lost node has been powered
off, then you can run 'fence_ack_manual'. Please be super careful about
this though. If you do this, in the heat of the moment with clients or
bosses yelling at you, and the peer isn't really off (ie: it's only
hung), you risk serious problems.
I can not emphasis strongly enough the caution needed when using this
command.
Is there a way to tell the cluster, don’t take into account node 1
anymore and don’t try to fence anymore, just start the application on
the node that is still ok ?
No. That would risk a split brain and data corruption. The only safe
option for the cluster, if the face of a failed fence, is to hang. As
bad as it is to hang, it's better than risking corruption.
I can’t possibly wait until power returns to that server. Downtime could
be too long.
See the solution I mentioned earlier.
·If I tell a node to leave the cluster in Luci, I would like it to
remain a non-cluster member after the reboot of that node. It rejoins
the cluster automatically after a reboot. Any way to prevent this ?
Thx
Don't let cman and rgmanager start on boot. This is always my policy. If
a node failed and got fenced, I want it to reboot, so that I can log
into it and figure out what happened, but I do _not_ want it back in the
cluster until I've determined it is healthy.
hth
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster