Re: Issue starting the CMAP service

Patrick Hemmer <corosync@xxxxxxxxxxxxxxx> · Thu, 10 Oct 2013 08:11:38 -0400

    It wasn't.

      I still havent fully tracked the issue down, but it was because of
      another node in the cluster. Node B which I had just started was
      trying to send traffic to node A. Node A was in a weird state.
      Node B would not start successfully until corosync on node A was
      restarted.

      I've had this happen a few times now in the last few days. The
      ability for one node to cause a start failure on another node is a
      significant problem.

      -Patrick

      From: Jan Friesse <jfriesse@xxxxxxxxxx>
      Sent:  2013-10-10 03:58:31 E
      To: Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>,
        Steven Dake <sdake@xxxxxxxxxx>
      CC: discuss@xxxxxxxxxxxx
      Subject: Re:  Issue starting the CMAP
        service

      Patrick,
I'm sure it's really firwall/switch problem. Please make sure that port
and port - 1 are not blocked. For a testing purposes, you can just
disable firewall completely and see if corosync works or not.

Regards,
  Honza

Patrick Hemmer napsal(a):

        *From: *Steven Dake <sdake@xxxxxxxxxx>
*Sent: * 2013-09-30 18:12:25 E
*To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>
*CC: *discuss@xxxxxxxxxxxx
*Subject: *Re:  Issue starting the CMAP service

          On 09/30/2013 02:43 PM, Patrick Hemmer wrote:

            *From: *Steven Dake <sdake@xxxxxxxxxx>
*Sent: * 2013-09-30 16:50:26 E
*To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>
*CC: *discuss@xxxxxxxxxxxx
*Subject: *Re:  Issue starting the CMAP service

              On 09/30/2013 01:45 PM, Patrick Hemmer wrote:

                I'm running corosync 2.3.2 on ubuntu precise. I'm playing with a 3
node cluster, and whenever I try to start corosync on one of the
nodes, it fails to start properly.
I just do a simple start with `corosync -f`, and whenever I try to 
use any of the tools, they error:

# corosync-cmapctl
Failed to initialize the cmap API. Error CS_ERR_TRY_AGAIN
# corosync-quorumtool
Cannot initialize CMAP service

If I wait long enough (about 9 minutes or 530 seconds), it does end
up starting, and the tools work, but corosync-quorumtool shows the
only member is itself.

However if I start corosync with `strace -f corosync -f` the tools
work fine immediately upon start (though it still doesn't show the
other nodes). Smells like race condition, but dunno where to begin.

              My guess is something is wrong with your network relating to
multicast.  Try using udpu mode - it is very stable now and removes
multicast from the list of things that can go wrong.

            I am using udpu, see the config :-)

          I assume you have the same config on all nodes?  If so, try using ip
addresses for the ring id.  possibly a DNS resolution problem?

Other then that, I'm stumped

        Yes, exact same config on all nodes. All hosts are present in
/etc/hosts. Also when I do a tcpdump on the other nodes, I see traffic
on port 5405 coming from the node in question.

          Regards
-steve

              Regards
-steve

                This is the output from `corosync -f` (this node is 10.20.0.212):
notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
notice  [TOTEM ] Initializing transmit/receive security (NSS)
crypto: none hash: none
notice  [TOTEM ] The network interface [10.20.0.212] is now up.
notice  [TOTEM ] adding new UDPU member {10.20.0.127}
notice  [TOTEM ] adding new UDPU member {10.20.0.212}
notice  [TOTEM ] adding new UDPU member {10.20.2.124}
notice  [TOTEM ] A new membership (10.20.0.212:1122820) was formed.
Members joined: 2
notice  [TOTEM ] A new membership (10.20.0.127:1122824) was formed.
Members joined: 1 3
### here is where it pauses for almost 9 minutes ###
error   [TOTEM ] FAILED TO RECEIVE
notice  [TOTEM ] A new membership (10.20.0.212:1122876) was formed.
Members left: 1 3
notice  [TOTEM ] A new membership (10.20.0.212:1122936) was formed.
Members
notice  [TOTEM ] A new membership (10.20.0.212:1123008) was formed.
Members
notice  [TOTEM ] A new membership (10.20.0.212:1123064) was formed.
Members
notice  [TOTEM ] A new membership (10.20.0.212:1123124) was formed.
Members
notice  [TOTEM ] A new membership (10.20.0.212:1123180) was formed.
Members
notice  [TOTEM ] A new membership (10.20.0.212:1123248) was formed.
Members
notice  [TOTEM ] A new membership (10.20.0.127:1123256) was formed.
Members joined: 1 3

This is the config (created by `pcs` utility), it's exactly the
same on all 3 nodes, and the other 2 nodes work fine:
----
totem {
version: 2
secauth: off
cluster_name: hapi-server
transport: udpu
}

nodelist {
  node {
        ring0_addr: i-74eb9c2f
        nodeid: 1
       }
  node {
        ring0_addr: i-a3bf0df9
        nodeid: 2
       }
  node {
        ring0_addr: i-ebcfcbb0
        nodeid: 3
       }
}

quorum {
provider: corosync_votequorum
}

logging {
to_syslog: yes
}
----

-Patrick

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

Here's some additional info from the command line utils after waiting 9
minutes for it to come up:

# corosync-quorumtool
Quorum information
------------------
Date:             Mon Sep 30 22:16:24 2013
Quorum provider:  corosync_votequorum
Nodes:            1
Node ID:          2
Ring ID:          1124320
Quorate:          No

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      1
Quorum:           2 Activity blocked
Flags:           

Membership information
----------------------
    Nodeid      Votes Name
         2          1 i-a3bf0df9 (local)

# corosync-cmapctl |grep member
runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.20.0.127)
runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 15
runtime.totem.pg.mrp.srp.members.1.status (str) = joined
runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.20.0.212)
runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
runtime.totem.pg.mrp.srp.members.2.status (str) = joined
runtime.totem.pg.mrp.srp.members.3.ip (str) = r(0) ip(10.20.2.124)
runtime.totem.pg.mrp.srp.members.3.join_count (u32) = 15
runtime.totem.pg.mrp.srp.members.3.status (str) = joined

-Patrick

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss