Problems with logging and cluster instability

dake@xxxxxxxxxx · Thu, 07 Aug 2008 19:44:10 +0200

Hello folks,

we've been having two nasty problems with a GFS cluster, currently  
running version 2.03.03 of cluster suite and 0.80.3 of OpenAIS.

The first is that for some time now, logging has been broken. We're  
getting kernel log messages from the DLM and GFS modules, but the  
userlnd utilities (i.e. OpenAIS) refuses to log at all when used with  
the cluster suite. Logging is fine when started without it (i.e.  
Default OpenAIS config file), so I'm pretty sure it's not the logging  
setup. Somehow, it seems that OpenAIS is not being given correct  
logging parameters by CMAN, and I really don't know why. I've tried  
including extra logging directives in cluster.conf, in various  
different forms, but to no avail. The cluster.conf we're using now is  
as follows:

<?xml version="1.0"?>
<cluster name="gfscluster" config_version="6">

  <clusternodes>
    <clusternode name="smb1-cluster" nodeid="1">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="1"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="smb1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="smb2-cluster" nodeid="2">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="2"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="smb2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="mail-cluster" nodeid="3">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="3"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="mail"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="backup-cluster" nodeid="4">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="4"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="backup"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
    <fencedevice name="powerswitch" agent="fence_epc"  
host="192.168.10.xx" passwd="xxx" action="4"/>
    <fencedevice name="manual" agent="fence_manual"/>
  </fencedevices>

  <fence_daemon post_join_delay="30">
  </fence_daemon>

  <logging to_syslog="yes" syslog_facility="local3">
    <logger ident="CPG" to_syslog="yes">
    </logger>
    <logger ident="CMAN" to_syslog="yes">
    </logger>
    <logger ident="CLM" to_syslog="yes">
    </logger>
  </logging>

</cluster>

Any idea why this might not be working?

The second problem is that once quorum is reached, any additional  
nodes joining will make the existing quorate cluster break apart. This  
behaviour has been seen in a three-node config with the third node  
joining, and in a four-node config with the fourth node joining. WHICH  
node is the last to join doesn't seem to make a difference. The  
"breaking apart" means that the newly joined node dies ("joining  
cluster with disallowed nodes, must die"), one of the existing nodes  
dies, and two of the other existing nodes keep running, but desynced -  
both show differing cluster membership and differing disallowed nodes.  
This is after a fresh reboot, so there is NO state in any node before  
joining. The crash occurs at the cman_tool join stage.

I have a gut feeling it might have something to do with our network  
config, which has a total of four ethernet interfaces in three of the  
nodes, and two in the fourth. The first three have two iSCSI  
interfaces, one for cluster use and one for regular LAN access. The  
last has only one iSCSI interface and no LAN access for now. Routing  
tables etc. should be setup properly; as you can see above,  
cluster.conf uses special hostnames for the cluster interfaces, which  
are resolved to IPs using hosts files which are identical on all four  
machines. I have yet to do any packet sniffing, and I have very little  
information log-wise due to the first problem, so I'm sure this is not  
a lot of info; but I thought I might include it anyway, in case  
someone can immediately point out the problem.

Thanks in advance,
Daniel

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster