Re: CMAN: got WAIT barrier not in phase 1 TRANSITION.96 (2)

Tom Mornini <tmornini@xxxxxxxxxxxxxx> · Wed, 18 Oct 2006 10:05:27 -0700

On Oct 16, 2006, at 8:34 AM, Patrick Caulfield wrote:

Tom Mornini wrote:
We're getting problems when adding cluster nodes to our cluster.

snip...

Oct 13 04:09:04 ey00-s00017 kernel: CMAN: Waiting to join or form a
Linux-cluster
Oct 13 04:09:05 ey00-s00017 kernel: CMAN: sending membership request
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00025
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00019
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00030
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00024
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00010
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00016
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00004
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00011
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00005
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00009
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00002
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00015
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00014
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00008
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00003
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00006
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00012
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00013
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00007
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00001
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-s00000
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-04
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-05
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-03
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-00
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-01
Oct 13 04:09:06 ey00-s00017 kernel: CMAN: got node ey00-02
Oct 13 04:09:06 ey00-s00017 kernel: dlm: no version for
"kcl_register_service" found: kernel tainted.
Oct 13 04:09:06 ey00-s00017 kernel: DLM 1.03.00 (built Sep  8 2006
03:50:23) installed
Oct 13 04:09:57 ey00-s00017 kernel: CMAN: node ey00-s00018 rejoining
Oct 13 04:17:18 ey00-s00017 kernel: CMAN: got WAIT barrier not in  
phase
1 TRANSITION.96 (2)

That message should be harmless. does it prevent the cluster  
reaching quorum ?

Oct 13 04:17:18 ey00-s00017 kernel: CMAN: got WAIT barrier not in  
phase
1 TRANSITION.96 (2)

That message should be harmless. does it prevent the cluster  
reaching quorum ?

Hello Patrick / list, I've been working with Tom on this problem.

It doesn't prevent quorum, although after this point the nodes
mysteriously can't seem to join the fence domain.  I've checked and it
doesn't appear that anyone is trying to fence anyone else, so I'm at a
bit of a loss to explain what's going on.

The really bizarre thing is that the old nodes don't seem to play with
the new ones despite them being joined into the cluster (i.e. fence
domain on old nodes shows running, fence domain on new node says joining
indefinitely).  If you prod it enough (start enough new nodes),
eventually the existing cluster will blow apart (nodes start kicking
each other for inconsistency and the like).

Let me explain a few things about our cluster:

We are running Xen.

The control VM for each node is in the cluster with 1 vote.

The application VMs are dynamically spawned and are entered into the
cluster.

The application VMs have 0 votes (so as to prevent one physical machine
from accidentally grabbing a quorum of votes if it has too many
application VMs running on it).

We are currently using fence_manual for debugging purposes (we have an
APC MasterSwitch to eventually use for fencing).

We are experiencing the following problems:

After a certain size (about 20 cluster members) we start having serious
issues with the cluster holding together.  Nodes are sometimes kicked
for having an inconsistent view.  There is often a complaint about the
count of members not matching between nodes as well.  Right now we have
the 1.03 version of everything installed (it was packaged and we are
trying to avoid building too much from scratch).

When a node starts up with an old cluster.conf, it never seems to
automatically update to the newer version.  If the file is updated while
a node is down, must it be manually synched up before resuming?

Finally, a random question.  When I'm debugging this stuff, I use
"cman_tool services" to keep tabs on some things.  What does the stuff
in the Code column mean?

--

Jayson Vantuyl
Quality Humans, Inc.

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster