On 4/27/2015 1:28 PM, Vasil Valchev wrote:
Hi,
I would advise you to use quorum disk _only_ as a last resort -
it's better to first get a solid understanding of the clustering
solution before adding additional complexity.
[Jatin] Thank you very much for sharing this tutorial. I will surely
go through it and gain more understanding.
Especially useful are the first chapters - the theory.
What I suspect is happening in your case is that your
cluster communication and fencing are over the same network,
which is not fault tolerant.
[Jatin]
My cluster communication happens over one network while fencing
happens over other network. I use two seperate vlans for this
purpose. Secondly when the cluster communication fails due to
network outage then fencing happens over the other vlan and both the
nodes get fenced.
So what happens if this network fails? Your 2 nodes can't
see each other, so they send fence requests, but the fence
devices are unreachable too, so those requests fail.
They are retried a few times I think, but if all fail, the
fence agent returns failed and your cluster is stuck in
"recovering" or stopped state.
Other times the network outage is shorter and the fence
succeeds, resulting in both nodes going down - this is solved
with the delay parameter.
The first issue is architectural one, it is the expected
behavior of the cluster to stop (or "freeze") all resources if
it can't guarantee the state of all members.
Read the article above it's really very useful.
Cheers!
Thanks
Jatin
|
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster