Patrick, do you have reproducer for that issue? Because if so, we can try to find out what is real problem and fix that. Regards, Honza Patrick Hemmer napsal(a): > It wasn't. > > I still havent fully tracked the issue down, but it was because of > another node in the cluster. Node B which I had just started was trying > to send traffic to node A. Node A was in a weird state. Node B would not > start successfully until corosync on node A was restarted. > > I've had this happen a few times now in the last few days. The ability > for one node to cause a start failure on another node is a significant > problem. > > > > -Patrick > > ------------------------------------------------------------------------ > *From: *Jan Friesse <jfriesse@xxxxxxxxxx> > *Sent: * 2013-10-10 03:58:31 E > *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>, Steven Dake > <sdake@xxxxxxxxxx> > *CC: *discuss@xxxxxxxxxxxx > *Subject: *Re: Issue starting the CMAP service > >> Patrick, >> I'm sure it's really firwall/switch problem. Please make sure that port >> and port - 1 are not blocked. For a testing purposes, you can just >> disable firewall completely and see if corosync works or not. >> >> Regards, >> Honza >> >> Patrick Hemmer napsal(a): >>> *From: *Steven Dake <sdake@xxxxxxxxxx> >>> *Sent: * 2013-09-30 18:12:25 E >>> *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx> >>> *CC: *discuss@xxxxxxxxxxxx >>> *Subject: *Re: Issue starting the CMAP service >>> >>>> On 09/30/2013 02:43 PM, Patrick Hemmer wrote: >>>>> *From: *Steven Dake <sdake@xxxxxxxxxx> >>>>> *Sent: * 2013-09-30 16:50:26 E >>>>> *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx> >>>>> *CC: *discuss@xxxxxxxxxxxx >>>>> *Subject: *Re: Issue starting the CMAP service >>>>> >>>>>> On 09/30/2013 01:45 PM, Patrick Hemmer wrote: >>>>>>> I'm running corosync 2.3.2 on ubuntu precise. I'm playing with a 3 >>>>>>> node cluster, and whenever I try to start corosync on one of the >>>>>>> nodes, it fails to start properly. >>>>>>> I just do a simple start with `corosync -f`, and whenever I try to >>>>>>> use any of the tools, they error: >>>>>>> >>>>>>> # corosync-cmapctl >>>>>>> Failed to initialize the cmap API. Error CS_ERR_TRY_AGAIN >>>>>>> # corosync-quorumtool >>>>>>> Cannot initialize CMAP service >>>>>>> >>>>>>> If I wait long enough (about 9 minutes or 530 seconds), it does end >>>>>>> up starting, and the tools work, but corosync-quorumtool shows the >>>>>>> only member is itself. >>>>>>> >>>>>>> However if I start corosync with `strace -f corosync -f` the tools >>>>>>> work fine immediately upon start (though it still doesn't show the >>>>>>> other nodes). Smells like race condition, but dunno where to begin. >>>>>>> >>>>>>> >>>>>> My guess is something is wrong with your network relating to >>>>>> multicast. Try using udpu mode - it is very stable now and removes >>>>>> multicast from the list of things that can go wrong. >>>>>> >>>>> I am using udpu, see the config :-) >>>>> >>>>> >>>> I assume you have the same config on all nodes? If so, try using ip >>>> addresses for the ring id. possibly a DNS resolution problem? >>>> >>>> Other then that, I'm stumped >>> Yes, exact same config on all nodes. All hosts are present in >>> /etc/hosts. Also when I do a tcpdump on the other nodes, I see traffic >>> on port 5405 coming from the node in question. >>> >>>> Regards >>>> -steve >>>> >>>>>> Regards >>>>>> -steve >>>>>> >>>>>>> This is the output from `corosync -f` (this node is 10.20.0.212): >>>>>>> notice [TOTEM ] Initializing transport (UDP/IP Unicast). >>>>>>> notice [TOTEM ] Initializing transmit/receive security (NSS) >>>>>>> crypto: none hash: none >>>>>>> notice [TOTEM ] The network interface [10.20.0.212] is now up. >>>>>>> notice [TOTEM ] adding new UDPU member {10.20.0.127} >>>>>>> notice [TOTEM ] adding new UDPU member {10.20.0.212} >>>>>>> notice [TOTEM ] adding new UDPU member {10.20.2.124} >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1122820) was formed. >>>>>>> Members joined: 2 >>>>>>> notice [TOTEM ] A new membership (10.20.0.127:1122824) was formed. >>>>>>> Members joined: 1 3 >>>>>>> ### here is where it pauses for almost 9 minutes ### >>>>>>> error [TOTEM ] FAILED TO RECEIVE >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1122876) was formed. >>>>>>> Members left: 1 3 >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1122936) was formed. >>>>>>> Members >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123008) was formed. >>>>>>> Members >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123064) was formed. >>>>>>> Members >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123124) was formed. >>>>>>> Members >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123180) was formed. >>>>>>> Members >>>>>>> notice [TOTEM ] A new membership (10.20.0.212:1123248) was formed. >>>>>>> Members >>>>>>> notice [TOTEM ] A new membership (10.20.0.127:1123256) was formed. >>>>>>> Members joined: 1 3 >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> This is the config (created by `pcs` utility), it's exactly the >>>>>>> same on all 3 nodes, and the other 2 nodes work fine: >>>>>>> ---- >>>>>>> totem { >>>>>>> version: 2 >>>>>>> secauth: off >>>>>>> cluster_name: hapi-server >>>>>>> transport: udpu >>>>>>> } >>>>>>> >>>>>>> nodelist { >>>>>>> node { >>>>>>> ring0_addr: i-74eb9c2f >>>>>>> nodeid: 1 >>>>>>> } >>>>>>> node { >>>>>>> ring0_addr: i-a3bf0df9 >>>>>>> nodeid: 2 >>>>>>> } >>>>>>> node { >>>>>>> ring0_addr: i-ebcfcbb0 >>>>>>> nodeid: 3 >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> quorum { >>>>>>> provider: corosync_votequorum >>>>>>> } >>>>>>> >>>>>>> logging { >>>>>>> to_syslog: yes >>>>>>> } >>>>>>> ---- >>>>>>> >>>>>>> >>>>>>> >>>>>>> -Patrick >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list >>>>>>> discuss@xxxxxxxxxxxx >>>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >>> Here's some additional info from the command line utils after waiting 9 >>> minutes for it to come up: >>> >>> # corosync-quorumtool >>> Quorum information >>> ------------------ >>> Date: Mon Sep 30 22:16:24 2013 >>> Quorum provider: corosync_votequorum >>> Nodes: 1 >>> Node ID: 2 >>> Ring ID: 1124320 >>> Quorate: No >>> >>> Votequorum information >>> ---------------------- >>> Expected votes: 3 >>> Highest expected: 3 >>> Total votes: 1 >>> Quorum: 2 Activity blocked >>> Flags: >>> >>> Membership information >>> ---------------------- >>> Nodeid Votes Name >>> 2 1 i-a3bf0df9 (local) >>> >>> >>> # corosync-cmapctl |grep member >>> runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.20.0.127) >>> runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 15 >>> runtime.totem.pg.mrp.srp.members.1.status (str) = joined >>> runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0 >>> runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.20.0.212) >>> runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1 >>> runtime.totem.pg.mrp.srp.members.2.status (str) = joined >>> runtime.totem.pg.mrp.srp.members.3.ip (str) = r(0) ip(10.20.2.124) >>> runtime.totem.pg.mrp.srp.members.3.join_count (u32) = 15 >>> runtime.totem.pg.mrp.srp.members.3.status (str) = joined >>> >>> >>> >>> -Patrick >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >>> > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss