Re: cman startup issue

gordan@xxxxxxxxxx · Wed, 7 Nov 2007 14:39:50 +0000 (GMT)

On Wed, 7 Nov 2007, Patrick Caulfield wrote:

I'm having a weird problem. I am using a shared GFS root file
system,
and the same initrd image on all the machines. The cluster has 3
machines on it at the moment, and 1 refuses to join the cluster,
regardless of which order I bring them up in.

When cman service is being started, it fails when starting cman:

cman not started: Can't find local node name in cluster.conf
/usr/local/sbin/cman_tool: aisexec daemon didn't start

If I try to run aisexec, I get:
aisexec: totemsrp.c:2867: memb_ring_id_store: Assertion `0' failed.

Where should I be looking for causes of this? I double checked my
cluster.conf and the MAC addresses, IP addresses and interface
names are
correct in each node's config.
Check that the new node can write into /tmp - where it is trying to
store the
current ring-id.  It could be SElinux or perhaps the permissions on
the file it
is trying to create.
That fixed the aisexec problem, but the "Can't find local node name in
cluster.conf" problem remains, and cman still won't start. :-(
Well, it won't start if it can' find the local node name in
cluster.conf ...
Have you double-checked that the name(s) in cluster.conf match those
on the
ethernet interfaces ?
You mean as in:
<eth name="eth1" mac="my:ma:ca:dd:re:ss" ip="10.1.2.3"
mask="255.255.255.0"/>
?

If so, then yes, I checked it about 10 times. That was the first thing I
thought was wrong. :-(
As I don't have your cluster.conf or access to your DNS server it's
hard to say
from here, but that message does mean what it says. If you have older
software
it might not detect anything other than the node's main hostname, but
later
versions will check all the interfaces on the system for something
that matches
anything in cluster.conf.
Well, the thing that really puzzles me is that the same cluster used to
work before. All I effectively did was move it to a different IP range
and changed cluster.conf. I can't figure out what could have changed in
the meantime to break it, other than cluster.conf. The only other thing
that's different is that some of the machines have eth1 and eth0
reversed. Before they all used eth1 for cluster configuration, and now
one of them uses eth0 (slightly different model, and the manufacturer
mislaeled the ports on them). But I have two identical machines, and one
connects, the other doesn't. It really has me stumped.

I see you're using eth1 so make sure you do have an up-to-date cman.
I'm running the latest that is available for RHEL5.

If that's what came with 5.0 then there's a bug in the name matching. I can't
figure out from the CVS tags in which package this was fixed unfortunately.

"revision 1.26
 date: 2007/03/15 11:12:33;  author: pcaulfield;  state: Exp;  lines: +16 -13
 If the machine is multi-homed, then using a truncated name in uname but not in
 cluster.conf would fail to match them up."

Well, I can tell you that the fix is NOT in cman-2.0.61, and it IS in
cman-2.0.73. Sorry I can't be more specific!

Assuming that's what's causing my problem, it's not in 2.0.64, as that is 
what I have.

Is there a workaround? What triggers the bug? Can I make it go away by 
using different node names? Is it affected by DNS?

Gordan

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster