Re: RHEL 6 two-node cluster - nodes killing each other's cman

Digimer <lists@xxxxxxxxxx> · Thu, 26 Jul 2012 13:43:39 -0400

That's a non-standard use of the cluster stack. It's design (and safe 
guards) suppose a configuration where the nodes fully work together and 
and redundant. So though it works, it's not going to work perfectly in 
your use-case.

And yes, you do need to restart cman (one way or the other)

Digimer

PS - Please reply to the mailing list. These replies can help others 
later by being in the archives.

On 07/26/2012 01:18 PM, DIMITROV, TANIO wrote:
The reason I don't want to reboot/fence the node is that my nodes are actually semi-independent - each one writes to its local file system which is then backed up on the other node when it becomes available.

So, the only way to rejoin the cluster is to start CPG sequence from 0 (clean state) by either rebooting the node or restarting CMAN?

-----Original Message-----
From: Digimer [mailto:lists@xxxxxxxxxx]
Sent: Thursday, July 26, 2012 12:47 PM
To: DIMITROV, TANIO
Cc: linux clustering
Subject: Re:  RHEL 6 two-node cluster - nodes killing each other's cman

For automatic recovery, you have to use power fencing. Fabric fencing
(like fencing at a SAN switch) is perfectly safe, but it requires human
intervention.

The problem is that the messages passed around the cluster in the closed
process group (CPG) are sequenced. Once a node falls out of sequence, it
needs to be restarted. To automate this, power fence the node. When it
boots back up, it should automatically rejoin the cluster with a clean
state.

May I ask why you're so careful to avoid a restart? The whole idea of
clustering is to have no/minimal interruption of service during a node
failure.

Digimer

On 07/26/2012 12:04 PM, DIMITROV, TANIO wrote:
Thanks Digimer,

Yes, this works but it cannot be done automatically - and that's my problem.
I'm trying to figure out what is the reason for killing CMAN - what if I use SAN switch as a fencing device to block access to the SAN - my node won't be rebooted and I will run into the same situation?
Is it at all possible for the node to rejoin the cluster without rebooting /CMAN restarting?
And if it is not, what about the SAN switch fencing scenario?

-----Original Message-----
From: Digimer [mailto:lists@xxxxxxxxxx]
Sent: Thursday, July 26, 2012 11:48 AM
To: linux clustering
Cc: DIMITROV, TANIO
Subject: Re:  RHEL 6 two-node cluster - nodes killing each other's cman

On 07/26/2012 11:44 AM, DIMITROV, TANIO wrote:
Hello,
I'm testing RHEL 6.2 cluster using CMAN.
It is a two-node cluster, no shared data. The problem is that if there is a connectivity problem between the nodes, each of them continues working as stand-alone - which is OK (no shared data, manual fencing). But when the connection comes back up the nodes kill each other's cman instances :

Jul 26 13:58:05.000 node1 corosync[15771]: cman killed by node 2 because we were killed by cman_tool or other application
Jul 26 13:58:05.000 node1 gfs_controld[15900]: cluster is down, exiting
Jul 26 13:58:05.000 node1 gfs_controld[15900]: daemon cpg_dispatch error 2
Jul 26 13:58:05.000 node1 dlm_controld[15848]: cluster is down, exiting

Can this be avoided somehow?

Thanks in advance!

Use real fencing.

The problem is, I believe, that the CPG messages fall out of sync. You
could try stopping cman on one node, reconnecting the network and
restarting cman on the one node again.

--
Digimer
Papers and Projects: https://alteeve.com

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster