Re: [cman] cant joint cluster after reboot

Yuriy Demchenko <demchenko.ya@xxxxxxxxx> · Thu, 07 Nov 2013 17:27:48 +0400



    Nope, nothing in logs suggests that
      node is fenced while in reboot. Moreover, same behaviour persists
      with pacemaker started - and I've explicitly put node into standby
      in pacemaker before reboot.

      And same behaviour persists with stonith-enabled=false; same
      behaviour with manual node fence via "stonith_admin --reboot
      node-1.spb.stone.local". So i suppose fencing isn't issue here.

      
      Yuriy Demchenko
      On 11/07/2013 05:11 PM, Vishesh kumar wrote:

    
        My understanding is node fenced while rebooting. I suggest
          you to look info fencing logs as well. If your fencing logs
          not in detail use following in cluster.conf to enable logging

          
          <logging>
             <logging_daemon name="fenced" debug="on"/>
  </logging>
          

        Thanks

      
        On Thu, Nov 7, 2013 at 5:34 PM, Yuriy
          Demchenko <demchenko.ya@xxxxxxxxx>
          wrote:

          Hi,

            
            I'm trying to set up 3-node cluster (2 nodes + 1 standby
            node for quorum) with cman+pacemaker stack, everything
            according this quickstart article: http://clusterlabs.org/quickstart-redhat.html

            
            Cluster starts, all nodes see each other, quorum gained,
            stonith working, but I've run into problem with cman: node
            cant join cluster after reboot - cman starts and cman_tool
            nodes reports only that node as cluster-member, while on
            other 2 nodes it reports 2 nodes as cluster-member and 3rd
            as offline. cman stop/start/restart on the problem node does
            no effect - it still can see only itself, but if i'll do
            cman restart on one of working nodes - everything goes back
            to normal, all 3 nodes joins the cluster and subsequent cman
            service restarts on any nodes works fine - node lefts
            cluster and rejoins sucessfully. But again - only till node
            OS reboot.

            
            For example:

            [1] Working cluster:

            
              [root@node-1 ~]# cman_tool nodes

              Node  Sts   Inc   Joined               Name

                 1   M    592   2013-11-07 15:20:54
               node-1.spb.stone.local

                 2   M    760   2013-11-07 15:20:54
               node-2.spb.stone.local

                 3   M    760   2013-11-07 15:20:54
               vnode-3.spb.stone.local

              [root@node-1 ~]# cman_tool status

              Version: 6.2.0

              Config Version: 10

              Cluster Name: ocluster

              Cluster Id: 2059

              Cluster Member: Yes

              Cluster Generation: 760

              Membership state: Cluster-Member

              Nodes: 3

              Expected votes: 3

              Total votes: 3

              Node votes: 1

              Quorum: 2

              Active subsystems: 7

              Flags:

              Ports Bound: 0

              Node name: node-1.spb.stone.local

              Node ID: 1

              Multicast addresses: 239.192.8.19

              Node addresses: 192.168.220.21

            
            Picture is same on all 3 nodes (except for node name and id)
            - same cluster name, cluster id, multicast addres.

            
            [2] I've put node-1 into reboot. After reboot complete,
            "cman_tool nodes" on node-2 and vnode-3 shows this:

            
              Node  Sts   Inc   Joined               Name

                 1   X    760                      
               node-1.spb.stone.local

                 2   M    588   2013-11-07 15:11:23
               node-2.spb.stone.local

                 3   M    760   2013-11-07 15:20:54
               vnode-3.spb.stone.local

              [root@node-2 ~]# cman_tool status

              Version: 6.2.0

              Config Version: 10

              Cluster Name: ocluster

              Cluster Id: 2059

              Cluster Member: Yes

              Cluster Generation: 764

              Membership state: Cluster-Member

              Nodes: 2

              Expected votes: 3

              Total votes: 2

              Node votes: 1

              Quorum: 2

              Active subsystems: 7

              Flags:

              Ports Bound: 0

              Node name: node-2.spb.stone.local

              Node ID: 2

              Multicast addresses: 239.192.8.19

              Node addresses: 192.168.220.22

            
            But, on rebooted node-1 it shows this:

            
              Node  Sts   Inc   Joined               Name

                 1   M    764   2013-11-07 15:49:01
               node-1.spb.stone.local

                 2   X      0                      
               node-2.spb.stone.local

                 3   X      0                      
               vnode-3.spb.stone.local

              [root@node-1 ~]# cman_tool status

              Version: 6.2.0

              Config Version: 10

              Cluster Name: ocluster

              Cluster Id: 2059

              Cluster Member: Yes

              Cluster Generation: 776

              Membership state: Cluster-Member

              Nodes: 1

              Expected votes: 3

              Total votes: 1

              Node votes: 1

              Quorum: 2 Activity blocked

              Active subsystems: 7

              Flags:

              Ports Bound: 0

              Node name: node-1.spb.stone.local

              Node ID: 1

              Multicast addresses: 239.192.8.19

              Node addresses: 192.168.220.21

            
            so, same cluster name, cluster id, multicast address - but
            it cant see other nodes. And there are nothing in
            /var/log/messages and /var/log/cluster/corosync.log on other
            two nodes - they seem not notice node-1 coming back online
            at all, last records about node-1 leaving cluster.

            
            [3] If now i do "service cman restart" on node-2 or vnode-3
            - everything goes back to normal operation as in [1]

            in logs it shows as node-2 leaving cluster (service stop)
            and simultaneously joining of both node-2 and node-1
            (service start)

            
              Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM]
              Members[2]: 2 3

              Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A
              processor joined or left the membership and a new
              membership was formed.

              Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to
              node 1

              Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen
              downlist: sender r(0) ip(192.168.220.22) ; members(old:3
              left:1)

              Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ]
              Completed service synchronization, ready to provide
              service.

              Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM]
              Members[1]: 3

              Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A
              processor joined or left the membership and a new
              membership was formed.

              Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen
              downlist: sender r(0) ip(192.168.220.14) ; members(old:2
              left:1)

              Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ]
              Completed service synchronization, ready to provide
              service.

              Nov  7 11:53:28 vnode-3 kernel: dlm: closing connection to
              node 2

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [TOTEM ] A
              processor joined or left the membership and a new
              membership was formed.

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM]
              Members[2]: 1 3

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM]
              Members[2]: 1 3

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM]
              Members[3]: 1 2 3

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM]
              Members[3]: 1 2 3

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM]
              Members[3]: 1 2 3

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [CPG   ] chosen
              downlist: sender r(0) ip(192.168.220.21) ; members(old:1
              left:0)

              Nov  7 11:53:30 vnode-3 corosync[26692]:   [MAIN  ]
              Completed service synchronization, ready to provide
              service.

            
            I've set up such cluster before in quite same configuration
            and never had any problems, but now I'm completely stuck.

            So, what is wrong with my cluster and how to fix it?

            
            OS Centos 6.4 with lastest updates, firewall disabled,
            selinux permissive, all 3 nodes inside same network.
            Multicast working - checked with omping.

            cman.x86_64                   3.0.12.1-49.el6_4.2
            @centos6-updates

            corosync.x86_64               1.4.1-15.el6_4.1
            @centos6-updates

            pacemaker.x86_64              1.1.10-1.el6_4.4
            @centos6-updates

            
            cluster.conf is in attach

                
                -- 

                Yuriy Demchenko

                
            --

            Linux-cluster mailing list

            Linux-cluster@xxxxxxxxxx

            https://www.redhat.com/mailman/listinfo/linux-cluster

          
        -- 

        http://linuxmantra.com
      
      
-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster