Hello folks,
we've been having two nasty problems with a GFS cluster, currently
running version 2.03.03 of cluster suite and 0.80.3 of OpenAIS.
The first is that for some time now, logging has been broken. We're
getting kernel log messages from the DLM and GFS modules, but the
userlnd utilities (i.e. OpenAIS) refuses to log at all when used with
the cluster suite. Logging is fine when started without it (i.e.
Default OpenAIS config file), so I'm pretty sure it's not the logging
setup. Somehow, it seems that OpenAIS is not being given correct
logging parameters by CMAN, and I really don't know why. I've tried
including extra logging directives in cluster.conf, in various
different forms, but to no avail. The cluster.conf we're using now is
as follows:
<?xml version="1.0"?>
<cluster name="gfscluster" config_version="6">
<clusternodes>
<clusternode name="smb1-cluster" nodeid="1">
<fence>
<method name="powerswitch">
<device name="powerswitch" port="1"/>
</method>
<method name="last_resort">
<device name="manual" nodename="smb1"/>
</method>
</fence>
</clusternode>
<clusternode name="smb2-cluster" nodeid="2">
<fence>
<method name="powerswitch">
<device name="powerswitch" port="2"/>
</method>
<method name="last_resort">
<device name="manual" nodename="smb2"/>
</method>
</fence>
</clusternode>
<clusternode name="mail-cluster" nodeid="3">
<fence>
<method name="powerswitch">
<device name="powerswitch" port="3"/>
</method>
<method name="last_resort">
<device name="manual" nodename="mail"/>
</method>
</fence>
</clusternode>
<clusternode name="backup-cluster" nodeid="4">
<fence>
<method name="powerswitch">
<device name="powerswitch" port="4"/>
</method>
<method name="last_resort">
<device name="manual" nodename="backup"/>
</method>
</fence>
</clusternode>
</clusternodes>
<fencedevices>
<fencedevice name="powerswitch" agent="fence_epc"
host="192.168.10.xx" passwd="xxx" action="4"/>
<fencedevice name="manual" agent="fence_manual"/>
</fencedevices>
<fence_daemon post_join_delay="30">
</fence_daemon>
<logging to_syslog="yes" syslog_facility="local3">
<logger ident="CPG" to_syslog="yes">
</logger>
<logger ident="CMAN" to_syslog="yes">
</logger>
<logger ident="CLM" to_syslog="yes">
</logger>
</logging>
</cluster>
Any idea why this might not be working?
The second problem is that once quorum is reached, any additional
nodes joining will make the existing quorate cluster break apart. This
behaviour has been seen in a three-node config with the third node
joining, and in a four-node config with the fourth node joining. WHICH
node is the last to join doesn't seem to make a difference. The
"breaking apart" means that the newly joined node dies ("joining
cluster with disallowed nodes, must die"), one of the existing nodes
dies, and two of the other existing nodes keep running, but desynced -
both show differing cluster membership and differing disallowed nodes.
This is after a fresh reboot, so there is NO state in any node before
joining. The crash occurs at the cman_tool join stage.
I have a gut feeling it might have something to do with our network
config, which has a total of four ethernet interfaces in three of the
nodes, and two in the fourth. The first three have two iSCSI
interfaces, one for cluster use and one for regular LAN access. The
last has only one iSCSI interface and no LAN access for now. Routing
tables etc. should be setup properly; as you can see above,
cluster.conf uses special hostnames for the cluster interfaces, which
are resolved to IPs using hosts files which are identical on all four
machines. I have yet to do any packet sniffing, and I have very little
information log-wise due to the first problem, so I'm sure this is not
a lot of info; but I thought I might include it anyway, in case
someone can immediately point out the problem.
Thanks in advance,
Daniel
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster