Problems with logging and cluster instability

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello folks,

we've been having two nasty problems with a GFS cluster, currently running version 2.03.03 of cluster suite and 0.80.3 of OpenAIS.

The first is that for some time now, logging has been broken. We're getting kernel log messages from the DLM and GFS modules, but the userlnd utilities (i.e. OpenAIS) refuses to log at all when used with the cluster suite. Logging is fine when started without it (i.e. Default OpenAIS config file), so I'm pretty sure it's not the logging setup. Somehow, it seems that OpenAIS is not being given correct logging parameters by CMAN, and I really don't know why. I've tried including extra logging directives in cluster.conf, in various different forms, but to no avail. The cluster.conf we're using now is as follows:

<?xml version="1.0"?>
<cluster name="gfscluster" config_version="6">

  <clusternodes>
    <clusternode name="smb1-cluster" nodeid="1">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="1"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="smb1"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="smb2-cluster" nodeid="2">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="2"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="smb2"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="mail-cluster" nodeid="3">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="3"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="mail"/>
        </method>
      </fence>
    </clusternode>
    <clusternode name="backup-cluster" nodeid="4">
      <fence>
        <method name="powerswitch">
          <device name="powerswitch" port="4"/>
        </method>
        <method name="last_resort">
          <device name="manual" nodename="backup"/>
        </method>
      </fence>
    </clusternode>
  </clusternodes>

  <fencedevices>
<fencedevice name="powerswitch" agent="fence_epc" host="192.168.10.xx" passwd="xxx" action="4"/>
    <fencedevice name="manual" agent="fence_manual"/>
  </fencedevices>

  <fence_daemon post_join_delay="30">
  </fence_daemon>

  <logging to_syslog="yes" syslog_facility="local3">
    <logger ident="CPG" to_syslog="yes">
    </logger>
    <logger ident="CMAN" to_syslog="yes">
    </logger>
    <logger ident="CLM" to_syslog="yes">
    </logger>
  </logging>

</cluster>

Any idea why this might not be working?

The second problem is that once quorum is reached, any additional nodes joining will make the existing quorate cluster break apart. This behaviour has been seen in a three-node config with the third node joining, and in a four-node config with the fourth node joining. WHICH node is the last to join doesn't seem to make a difference. The "breaking apart" means that the newly joined node dies ("joining cluster with disallowed nodes, must die"), one of the existing nodes dies, and two of the other existing nodes keep running, but desynced - both show differing cluster membership and differing disallowed nodes. This is after a fresh reboot, so there is NO state in any node before joining. The crash occurs at the cman_tool join stage.

I have a gut feeling it might have something to do with our network config, which has a total of four ethernet interfaces in three of the nodes, and two in the fourth. The first three have two iSCSI interfaces, one for cluster use and one for regular LAN access. The last has only one iSCSI interface and no LAN access for now. Routing tables etc. should be setup properly; as you can see above, cluster.conf uses special hostnames for the cluster interfaces, which are resolved to IPs using hosts files which are identical on all four machines. I have yet to do any packet sniffing, and I have very little information log-wise due to the first problem, so I'm sure this is not a lot of info; but I thought I might include it anyway, in case someone can immediately point out the problem.

Thanks in advance,
Daniel

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux