Re: Issue starting the CMAP service

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick,
do you have reproducer for that issue? Because if so, we can try to find
out what is real problem and fix that.

Regards,
  Honza

Patrick Hemmer napsal(a):
> It wasn't.
> 
> I still havent fully tracked the issue down, but it was because of
> another node in the cluster. Node B which I had just started was trying
> to send traffic to node A. Node A was in a weird state. Node B would not
> start successfully until corosync on node A was restarted.
> 
> I've had this happen a few times now in the last few days. The ability
> for one node to cause a start failure on another node is a significant
> problem.
> 
> 
> 
> -Patrick
> 
> ------------------------------------------------------------------------
> *From: *Jan Friesse <jfriesse@xxxxxxxxxx>
> *Sent: * 2013-10-10 03:58:31 E
> *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>, Steven Dake
> <sdake@xxxxxxxxxx>
> *CC: *discuss@xxxxxxxxxxxx
> *Subject: *Re:  Issue starting the CMAP service
> 
>> Patrick,
>> I'm sure it's really firwall/switch problem. Please make sure that port
>> and port - 1 are not blocked. For a testing purposes, you can just
>> disable firewall completely and see if corosync works or not.
>>
>> Regards,
>>   Honza
>>
>> Patrick Hemmer napsal(a):
>>> *From: *Steven Dake <sdake@xxxxxxxxxx>
>>> *Sent: * 2013-09-30 18:12:25 E
>>> *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>
>>> *CC: *discuss@xxxxxxxxxxxx
>>> *Subject: *Re:  Issue starting the CMAP service
>>>
>>>> On 09/30/2013 02:43 PM, Patrick Hemmer wrote:
>>>>> *From: *Steven Dake <sdake@xxxxxxxxxx>
>>>>> *Sent: * 2013-09-30 16:50:26 E
>>>>> *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>
>>>>> *CC: *discuss@xxxxxxxxxxxx
>>>>> *Subject: *Re:  Issue starting the CMAP service
>>>>>
>>>>>> On 09/30/2013 01:45 PM, Patrick Hemmer wrote:
>>>>>>> I'm running corosync 2.3.2 on ubuntu precise. I'm playing with a 3
>>>>>>> node cluster, and whenever I try to start corosync on one of the
>>>>>>> nodes, it fails to start properly.
>>>>>>> I just do a simple start with `corosync -f`, and whenever I try to 
>>>>>>> use any of the tools, they error:
>>>>>>>
>>>>>>> # corosync-cmapctl
>>>>>>> Failed to initialize the cmap API. Error CS_ERR_TRY_AGAIN
>>>>>>> # corosync-quorumtool
>>>>>>> Cannot initialize CMAP service
>>>>>>>
>>>>>>> If I wait long enough (about 9 minutes or 530 seconds), it does end
>>>>>>> up starting, and the tools work, but corosync-quorumtool shows the
>>>>>>> only member is itself.
>>>>>>>
>>>>>>> However if I start corosync with `strace -f corosync -f` the tools
>>>>>>> work fine immediately upon start (though it still doesn't show the
>>>>>>> other nodes). Smells like race condition, but dunno where to begin.
>>>>>>>
>>>>>>>
>>>>>> My guess is something is wrong with your network relating to
>>>>>> multicast.  Try using udpu mode - it is very stable now and removes
>>>>>> multicast from the list of things that can go wrong.
>>>>>>
>>>>> I am using udpu, see the config :-)
>>>>>
>>>>>
>>>> I assume you have the same config on all nodes?  If so, try using ip
>>>> addresses for the ring id.  possibly a DNS resolution problem?
>>>>
>>>> Other then that, I'm stumped
>>> Yes, exact same config on all nodes. All hosts are present in
>>> /etc/hosts. Also when I do a tcpdump on the other nodes, I see traffic
>>> on port 5405 coming from the node in question.
>>>
>>>> Regards
>>>> -steve
>>>>
>>>>>> Regards
>>>>>> -steve
>>>>>>
>>>>>>> This is the output from `corosync -f` (this node is 10.20.0.212):
>>>>>>> notice  [TOTEM ] Initializing transport (UDP/IP Unicast).
>>>>>>> notice  [TOTEM ] Initializing transmit/receive security (NSS)
>>>>>>> crypto: none hash: none
>>>>>>> notice  [TOTEM ] The network interface [10.20.0.212] is now up.
>>>>>>> notice  [TOTEM ] adding new UDPU member {10.20.0.127}
>>>>>>> notice  [TOTEM ] adding new UDPU member {10.20.0.212}
>>>>>>> notice  [TOTEM ] adding new UDPU member {10.20.2.124}
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1122820) was formed.
>>>>>>> Members joined: 2
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.127:1122824) was formed.
>>>>>>> Members joined: 1 3
>>>>>>> ### here is where it pauses for almost 9 minutes ###
>>>>>>> error   [TOTEM ] FAILED TO RECEIVE
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1122876) was formed.
>>>>>>> Members left: 1 3
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1122936) was formed.
>>>>>>> Members
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123008) was formed.
>>>>>>> Members
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123064) was formed.
>>>>>>> Members
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123124) was formed.
>>>>>>> Members
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123180) was formed.
>>>>>>> Members
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.212:1123248) was formed.
>>>>>>> Members
>>>>>>> notice  [TOTEM ] A new membership (10.20.0.127:1123256) was formed.
>>>>>>> Members joined: 1 3
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> This is the config (created by `pcs` utility), it's exactly the
>>>>>>> same on all 3 nodes, and the other 2 nodes work fine:
>>>>>>> ----
>>>>>>> totem {
>>>>>>> version: 2
>>>>>>> secauth: off
>>>>>>> cluster_name: hapi-server
>>>>>>> transport: udpu
>>>>>>> }
>>>>>>>
>>>>>>> nodelist {
>>>>>>>   node {
>>>>>>>         ring0_addr: i-74eb9c2f
>>>>>>>         nodeid: 1
>>>>>>>        }
>>>>>>>   node {
>>>>>>>         ring0_addr: i-a3bf0df9
>>>>>>>         nodeid: 2
>>>>>>>        }
>>>>>>>   node {
>>>>>>>         ring0_addr: i-ebcfcbb0
>>>>>>>         nodeid: 3
>>>>>>>        }
>>>>>>> }
>>>>>>>
>>>>>>> quorum {
>>>>>>> provider: corosync_votequorum
>>>>>>> }
>>>>>>>
>>>>>>> logging {
>>>>>>> to_syslog: yes
>>>>>>> }
>>>>>>> ----
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -Patrick
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list
>>>>>>> discuss@xxxxxxxxxxxx
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>> Here's some additional info from the command line utils after waiting 9
>>> minutes for it to come up:
>>>
>>> # corosync-quorumtool
>>> Quorum information
>>> ------------------
>>> Date:             Mon Sep 30 22:16:24 2013
>>> Quorum provider:  corosync_votequorum
>>> Nodes:            1
>>> Node ID:          2
>>> Ring ID:          1124320
>>> Quorate:          No
>>>
>>> Votequorum information
>>> ----------------------
>>> Expected votes:   3
>>> Highest expected: 3
>>> Total votes:      1
>>> Quorum:           2 Activity blocked
>>> Flags:           
>>>
>>> Membership information
>>> ----------------------
>>>     Nodeid      Votes Name
>>>          2          1 i-a3bf0df9 (local)
>>>
>>>
>>> # corosync-cmapctl |grep member
>>> runtime.totem.pg.mrp.srp.members.1.ip (str) = r(0) ip(10.20.0.127)
>>> runtime.totem.pg.mrp.srp.members.1.join_count (u32) = 15
>>> runtime.totem.pg.mrp.srp.members.1.status (str) = joined
>>> runtime.totem.pg.mrp.srp.members.2.config_version (u64) = 0
>>> runtime.totem.pg.mrp.srp.members.2.ip (str) = r(0) ip(10.20.0.212)
>>> runtime.totem.pg.mrp.srp.members.2.join_count (u32) = 1
>>> runtime.totem.pg.mrp.srp.members.2.status (str) = joined
>>> runtime.totem.pg.mrp.srp.members.3.ip (str) = r(0) ip(10.20.2.124)
>>> runtime.totem.pg.mrp.srp.members.3.join_count (u32) = 15
>>> runtime.totem.pg.mrp.srp.members.3.status (str) = joined
>>>
>>>
>>>
>>> -Patrick
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux