Re: Nodes leaving and re-joining intermittently

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Please find below the cluster.conf Matt mentioned.

Regarding logs, I have verified the 2 SNMP trap notifications that Matt posted in his first message are the only ones that were processed by our script anywhere near this event window (days until the previous one, none since). I will have a look in the on-disk logging tomorrow and see if there's anything of any worth over that time period on any of the cluster nodes.

Thanks,

Chris

<?xml version="1.0"?>
<cluster config_version="30" name="camra">

<fence_daemon clean_start="1" post_fail_delay="30" post_join_delay="30" override_time="30"/>

        <clusternodes>
                <clusternode name="xxx.xxx.xxx.1" nodeid="1">
                        <fence>
                                <method name="ilo">
                                        <device name="ilo1"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xxx.xxx.xxx.2" nodeid="2">
                        <fence>
                                <method name="ilo">
                                        <device name="ilo2"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="xxx.xxx.xxx.3" nodeid="3">
                        <fence>
                                <method name="ilo">
                                        <device name="ilo3"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        
<rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="mysql" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.1" priority="1"/>
                                <failoverdomainnode name="xxx.xxx.xxx.2" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="solr" nofailback="0" ordered="1" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.2" priority="2"/>
                                <failoverdomainnode name="xxx.xxx.xxx.1" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="cluster1" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.1"/>
                        </failoverdomain>
                        <failoverdomain name="cluster2" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.2"/>
                        </failoverdomain>
                        <failoverdomain name="cluster3" nofailback="1" ordered="0" restricted="1">
                                <failoverdomainnode name="xxx.xxx.xxx.3"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/fstest" name="fstest"/>
                        <script file="/etc/init.d/postfixstatus" name="postfixstatus"/>
                        <script file="/etc/init.d/snmpdstatus" name="snmpdstatus"/>
                        <script file="/etc/init.d/snmptrapdstatus" name="snmptrapdstatus"/>
                        <script file="/etc/init.d/foghornstatus" name="foghornstatus"/>
                </resources>
                <service domain="cluster1" max_restarts="50" name="snmptrap1" recovery="restart-disable" restart_expire_time="900">
                        <script ref="fstest"/>
                        <script ref="foghornstatus"/>
                        <script ref="snmpdstatus"/>
                        <script ref="postfixstatus"/>
                        <script ref="snmptrapdstatus"/>
                </service>
                <service domain="cluster2" max_restarts="50" name="snmptrap2" recovery="restart-disable" restart_expire_time="900">
                        <script ref="fstest"/>
                        <script ref="foghornstatus"/>
                        <script ref="snmpdstatus"/>
                        <script ref="postfixstatus"/>
                        <script ref="snmptrapdstatus"/>
                </service>
                <service domain="cluster3" max_restarts="50" name="snmptrap3" recovery="restart-disable" restart_expire_time="900">
                        <script ref="fstest"/>
                        <script ref="foghornstatus"/>
                        <script ref="snmpdstatus"/>
                        <script ref="postfixstatus"/>
                        <script ref="snmptrapdstatus"/>
                </service>
        </rm>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.101" login="x" name="ilo1" passwd="x"/>
                <fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.102" login="x" name="ilo2" passwd="x"/>
                <fencedevice agent="fence_ipmilan" ipaddr="xxx.xxx.xxx.103" login="x" name="ilo3" passwd="x"/>
        </fencedevices>
</cluster>

On 11 December 2011 11:12, Matthew Painter <matthew.painter@xxxxxxxxxx> wrote:
Thank you for your input :)

The nodes are syncd using NTP. Although I am unsure about the respective run levels.

I will look into this, thank you.


On Sun, Dec 11, 2011 at 7:16 AM, Dukhan, Meir <Mdukhan@xxxxxxx> wrote:

Are your nodes time synced and how?

We ran into problems of nodes being fenced because NTP problem.

The solution (AFAIR, from the Redhat knowledge base) was to start ntpd _before_ cman.
I'm not sure but there could be an update of openais or ntpd re this issue.

For those of you who have RedHat account, see the RedHat KB article:

       Does cman need to have the time of nodes in sync?
       https://access.redhat.com/kb/docs/DOC-42471

Hope this help,

Regards,
-- Meir R. Dukhan

|-----Original Message-----
|From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-
|bounces@xxxxxxxxxx] On Behalf Of Digimer
|Sent: Sunday, December 11, 2011 0:23 AM
|To: Matthew Painter
|Cc: linux clustering
|Subject: Re: Nodes leaving and re-joining intermittently
|
|On 12/10/2011 05:00 PM, Matthew Painter wrote:
|> The switch was our first thought, but that has been swapped, and while
|> we are not having nodes fenced anymore (we were daily), this anomoly
|> remains.
|>
|> I will ask for those logs and conf on Monday.
|>
|> I think it might be worth reinstalling corosync on this box anyway?
|> Can't be healthy if it is exiting unclearly. I have has reports of the
|> rgmanager dying on this box. (pid file but not running) Could that be
|> related?
|>
|> Thanks :)
|
|It's impossible to say without knowing your configuration. Please share the
|cluster.conf (only obfuscate passwords, please) along with the log files.
|The more detail, the better. Versions, distros, network config, etc.
|
|Uninstalling corosync is not likely help. RGManager is something fairly
|high up in the stack, so it's not likely the cause either.
|
|Did you configure the timeouts to be very high, by chance? I'm finding it
|difficult to fathom how the node can withdraw without being fenced, short
|of cleanly stopping the cluster stack. I suspect there is something
|important not being said, which the configuration information, versions and
|logs will hopefully expose.
|
|--
|Digimer
|E-Mail:              digimer@xxxxxxxxxxx
|Freenode handle:     digimer
|Papers and Projects: http://alteeve.com
|Node Assassin:       http://nodeassassin.org
|"omg my singularity battery is dead again.
|stupid hawking radiation." - epitron
|
|--
|Linux-cluster mailing list
|Linux-cluster@xxxxxxxxxx
|https://www.redhat.com/mailman/listinfo/linux-cluster

This message is confidential and intended only for the addressee. If you have received this message in error, please immediately notify the postmaster@xxxxxxx and delete it from your system as well as any copies. The content of e-mails as well as traffic data may be monitored by NDS for employment and security purposes.
To protect the environment please do not print this e-mail unless necessary.

An NDS Group Limited company. www.nds.com


--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux