RE: Problems W/GFS after rebooting 1 node on the other nodes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Title: RE: Problems W/GFS after rebooting 1 node on the other nodes

Hello all,
         Well after fighting this problem for the last few months, I finally figured out my problem, the day after I posted this first email. The problem is caused by the number of GFS file systems being mounted. I had 200 GFS file systems mounted on each of my 3 nodes, the same 200 FS on each node. I learned about a months that I could not exceed 250 file systems on each node. 251 caused CMAN to stop and GFS to hang on all nodes. If 250 is the limit for 3 nodes, that means you can not exceed ~83 DLM processes per node. When I rebooted one existing current nodes it increased the DLM count on each of the remaining nodes to 100, which is just to many.

        What is the official maximum number of GFS file systems mounted per node? I can not find this information anywhere.

        Jon

_____________________________________________
From:   Swift, Jon S              PWR 
Sent:   Thursday, April 17, 2008 4:42 PM
To:     'linux-cluster@xxxxxxxxxx'
Subject:        Problems W/GFS after rebooting 1 node on the other nodes

Hi,
        My name is Jon Swift and I have been trying on and off for a few months to get NFS v3 using GFS1 and RHEL5.1 to work dependably. I have a 3 node cluster made up of Dell 1850's using 2 HBA's to connect to a EMC CX3-40 SAN on each. I use PowerPath 5.1 to manage the multiple paths to the SAN, and I have tried both a Dell DRAC and IPMI fencing, currently configured to use IPMI fencing. My number one dependability issue is, that following a reboot of 1 node in my cluster, I have problems with the remaining nodes. I'm using the following steps before rebooting the node.

/etc/init.d/rgmanager stop
exportfs -ua
/etc/init.d/nfs stop
/etc/init.d/gfs stop
/etc/init.d/clvmd stop
/sbin/fence_tool leave
sync; sync
sleep 5
/etc/init.d/cman stop leave
Reboot

        Then on the other nodes in the cluster just after the reboot step completes on the rebooted node CMAN is stopped, and GFS hangs on the remaining nodes, most of the time. The following is typical of the entries in the /var/log/messages file.

Apr 17 15:07:46 cnfs03 openais[5179]: [TOTEM] entering GATHER state from 12.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering GATHER state from 11.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] Saving state aru 1016 high seq received 1016
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] Storing new sequence id for ring 6ffe0
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering COMMIT state.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering RECOVERY state.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] position [0] member 192.168.15.2:
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] previous ring seq 458716 rep 192.168.15.1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] aru 1016 high delivered 1016 received flag 1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] position [1] member 192.168.15.3:
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] previous ring seq 458716 rep 192.168.15.1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] aru 1016 high delivered 1016 received flag 1
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] Did not need to originate any messages in recovery.
Apr 17 15:07:51 cnfs03 kernel: dlm: closing connection to node 1
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] CLM CONFIGURATION CHANGE
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] New Configuration:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ]   r(0) ip(192.168.15.2)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ]   r(0) ip(192.168.15.3)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] Members Left:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ]   r(0) ip(192.168.15.1)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] Members Joined:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] CLM CONFIGURATION CHANGE
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] New Configuration:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ]   r(0) ip(192.168.15.2)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ]   r(0) ip(192.168.15.3)
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] Members Left:
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] Members Joined:
Apr 17 15:07:51 cnfs03 openais[5179]: [SYNC ] This node is within the primary component and will provide servi
ce.
Apr 17 15:07:51 cnfs03 openais[5179]: [TOTEM] entering OPERATIONAL state.
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] got nodejoin message 192.168.15.2
Apr 17 15:07:51 cnfs03 openais[5179]: [CLM  ] got nodejoin message 192.168.15.3
Apr 17 15:07:57 cnfs03 gfs_controld[5207]: cluster is down, exiting
Apr 17 15:07:57 cnfs03 dlm_controld[5201]: groupd is down, exiting
Apr 17 15:07:57 cnfs03 kernel: dlm: connecting to 2
Apr 17 15:07:57 cnfs03 fenced[5195]: cluster is down, exiting
Apr 17 15:08:01 cnfs03 kernel: dlm: closing connection to node 3
Apr 17 15:08:10 cnfs03 kernel: dlm: closing connection to node 2
Apr 17 15:08:23 cnfs03 ccsd[5159]: Unable to connect to cluster infrastructure after 30 seconds.

        When the node rebooted comes back up it starts CMAN, but does not start clvmd and does not join the fence domain. And of course none of the GFS file systems are mounted.

Below is a copy of the /etc/cluster/cluster.conf file.

<?xml version="1.0"?>
<cluster alias="cnfs_cluster" config_version="40" name="cnfs">
        <fence_daemon clean_start="0" post_fail_delay="2" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="ic-cnfs01" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="IPMI_LAN_CNFS01"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="ic-cnfs02" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="" name="IPMI_LAN_CNFS02"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="ic-cnfs03" nodeid="3" votes="1">
                        <fence>
                                <method name="1">
                                        <device lanplus="" name="IPMI_LAN_CNFS03"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman/>
        <fencedevices>
                <fencedevice agent="fence_ipmilan" auth="none" ipaddr="ipmi-cnfs01" login="root" name="IPMI_LAN_CNFS01" passwd="Rocknro11"/>

                <fencedevice agent="fence_ipmilan" auth="md5" ipaddr="ipmi-cnfs02" login="root" name="IPMI_LAN_CNFS02" passwd="Rocknro11"/>

                <fencedevice agent="fence_ipmilan" auth="md5" ipaddr="ipmi-cnfs03" login="root" name="IPMI_LAN_CNFS03" passwd="Rocknro11"/>

        </fencedevices>
        <rm>
                <failoverdomains>
                        <failoverdomain name="cnfs03-down-failover" ordered="1">
                                <failoverdomainnode name="ic-cnfs01" priority="3"/>
                                <failoverdomainnode name="ic-cnfs02" priority="2"/>
                                <failoverdomainnode name="ic-cnfs03" priority="1"/>
                        </failoverdomain>
                        <failoverdomain name="cnfs02-down-failover" ordered="1">
                                <failoverdomainnode name="ic-cnfs01" priority="2"/>
                                <failoverdomainnode name="ic-cnfs02" priority="1"/>
                                <failoverdomainnode name="ic-cnfs03" priority="3"/>
                        </failoverdomain>
                        <failoverdomain name="cnfs01-down-failover" ordered="1">
                                <failoverdomainnode name="ic-cnfs01" priority="1"/>
                                <failoverdomainnode name="ic-cnfs02" priority="3"/>
                                <failoverdomainnode name="ic-cnfs03" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="cnfs01-up-failover" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="1"/>
                                <failoverdomainnode name="ic-cnfs02" priority="2"/>
                                <failoverdomainnode name="ic-cnfs03" priority="3"/>
                        </failoverdomain>
                        <failoverdomain name="cnfs02-up-failover" ordered="1">
                                <failoverdomainnode name="ic-cnfs01" priority="3"/>
                                <failoverdomainnode name="ic-cnfs02" priority="1"/>
                                <failoverdomainnode name="ic-cnfs03" priority="2"/>
                        </failoverdomain>
                        <failoverdomain name="cnfs03-up-failover" ordered="1" restricted="0">
                                <failoverdomainnode name="ic-cnfs01" priority="2"/>
                                <failoverdomainnode name="ic-cnfs02" priority="3"/>
                                <failoverdomainnode name="ic-cnfs03" priority="1"/>
                        </failoverdomain>
                </failoverdomains>
                <resources/>
                <service autostart="1" domain="cnfs01-up-failover" name="cnfs01-vip1" recovery="restart">
                        <ip address="XXXXXXX" monitor_link="1"/>
                </service>
                       </service>
               <service autostart="1" domain="cnfs02-up-failover" name="cnfs02-vip1" recovery="restart">
                       <ip address="XXXXXXX" monitor_link="1"/>
               </service>
               <service autostart="1" domain="cnfs03-up-failover" name="cnfs03-vip1" recovery="restart">
                       <ip address="XXXXXXX" monitor_link="1"/>
               </service>
               <service autostart="1" domain="cnfs01-down-failover" name="cnfs01-vip2" recovery="restart">
                       <ip address="XXXXXXX" monitor_link="1"/>
               </service>
               <service autostart="1" domain="cnfs02-down-failover" name="cnfs02-vip2" recovery="restart">
                       <ip address="XXXXXXX" monitor_link="1"/>
               </service>
               <service autostart="1" domain="cnfs03-down-failover" name="cnfs03-vip2" recovery="restart">
                       <ip address="XXXXXXX" monitor_link="1"/>
               </service>
       </rm>
</cluster>

        How can I safely reboot one node, without affecting CMAN and GFS on the remaining nodes in the cluster?

        Thanks in advance.


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 Jon Swift                   Pratt & Whitney Rocketdyne
                             Unix Team Technical Lead
                             email  : jon.swift@xxxxxxxxxxx
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux