vahram wrote:
Rick Stevens wrote:
I had a similar issue. The problem was with the multicast routing.
I was using two NICs on each node...one public (eth0) and one private
(eth1), with the default gateway going out eth0.
The route for the multicast (224.x.x.x) was going out the default
gateway and not reaching the other machine. By putting in a fixed route
in for multicast:
route add -net 224.0.0.0/8 dev eth1
it all started working. This was my fix, it may not work for you.
Also, I use the CVS code from http://sources.redhat.com/cluster and
not the source RPMs from where you specified.
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer rstevens@xxxxxxxxxxxxxxx -
- VitalStream, Inc. http://www.vitalstream.com -
- -
- Veni, Vidi, VISA: I came, I saw, I did a little shopping. -
----------------------------------------------------------------------
--
Linux-cluster@xxxxxxxxxx
http://www.redhat.com/mailman/listinfo/linux-cluster
Yeap, both boxes have two NICs. eth0 is public, and eth1 is private
(192.168.2.x). I tried adding the route, and that didn't fix it. I've
also tried disabling the private NIC before and running with one public
NIC, and that didn't fix it either. One other interesting thing I
noticed...when I run cman_tool join on nodeA, netstat shows ccsd trying
to do this:
tcp 0 0 127.0.0.1:50006 127.0.0.1:739
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:738
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:737
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:736
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:743
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:742
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:741
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:740
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:727
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:731
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:730
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:729
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:728
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:735
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:734
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:733
TIME_WAIT -
tcp 0 0 127.0.0.1:50006 127.0.0.1:732
TIME_WAIT -
Looking back at your cluster.conf, I see you're using broadcast. I used
multicast because, in the first CVS checkout I did, broadcast didn't
work properly. It's possible your SRPMs also have that flaw. Why not
try multicast and see if that works. Add that route I mentioned and
here's my cluster.conf which you can crib:
<?xml version="1.0"?>
<cluster name="test" config_version="1">
<cman two-node="1" expected_votes="1">
<multicast addr="224.0.0.1"/>
</cman>
<nodes>
<node name="gfs-01-001" votes="1">
<multicast addr="224.0.0.1" interface="eth1"/>
<fence>
<method name="single">
<device name="human" ipaddr="gfs-01-001"/>
</method>
</fence>
</node>
<node name="gfs-01-002" votes="1">
<multicast addr="224.0.0.1" interface="eth1"/>
<fence>
<method name="single">
<device name="human" ipaddr="gfs-01-002"/>
</method>
</fence>
</node>
</nodes>
<fence_devices>
<device name="human" agent="fence_manual"/>
</fence_devices>
</cluster>
----------------------------------------------------------------------
- Rick Stevens, Senior Systems Engineer rstevens@xxxxxxxxxxxxxxx -
- VitalStream, Inc. http://www.vitalstream.com -
- -
- What's small, yellow and very, VERY dangerous? The root canary! -
----------------------------------------------------------------------