Re: CPG reporting group member that doesn't exist

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 10 Feb 2014 11:27:44 +0100

Patrick,
thanks for files. Actually, I was able to find out logical bug in
corosync source code. Easy way to reproduce problem:
- 3 nodes running
- Pause one node
- On different node, stop cpg client and exec it again
- Unpause node
- Paused node will now have two entries (one old, one new)

I will try to come up with patch.

Thanks,
  Honza

Patrick Hemmer napsal(a):
> Ok, I've uploaded data to S3. Links below.
> There shouldn't have been any splits. We haven't had any network
> interruption that I am aware of.
> I bounced corosync on the 10.20.0.127 node and everything cleared up.
> 
> As this occurred in our development environment, there is a ton of
> background noise, so I'm unable to pinpoint exactly when the issue
> started. But I noticed it around 2014-02-07 01:00 GMT.
> 
> blackbox:
> https://s3.amazonaws.com/cloudcom-cliff-misc/corosync-blackbox.10.20.0.127.gz
> core:
> https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.core.26622.gz
> log:
> https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.log.gz
> 
> Thanks
> 
> -Patrick
> 
> ------------------------------------------------------------------------
> *From: *Jan Friesse <jfriesse@xxxxxxxxxx>
> *Sent: * 2014-02-07 03:24:36 E
> *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>, discuss@xxxxxxxxxxxx
> *Subject: *Re:  CPG reporting group member that doesn't exist
> 
>> Patrick,
>> blackbox may be useful. Also log may help us trace what happened. This
>> looks like some kind of problem when corosync nodes split and then join
>> again... anyway, it's weird and looks like a bug. Another helpful thing
>> may be coredump of corosync from affected node (so 10.20.0.127) to
>> ensure it is not memory corruption problem.
>>
>> Regards,
>>   Honza
>>
>>
>> Patrick Hemmer napsal(a):
>>> I've currently got a 3 node cluster with several processes on each box
>>> using CPG. CPG on one of the boxes is reporting a member of a group that
>>> isn't there.
>>>
>>> # 10.20.2.124 # corosync-cpgtool
>>> Group Name           PID       Node ID
>>> r53clip
>>>              17891     169083092 (10.20.0.212)
>>>              21792     169083516 (10.20.2.124)
>>> hapi
>>>              17837     169083092 (10.20.0.212)
>>>              21717     169083516 (10.20.2.124)
>>> arbiter
>>>              21590     169083007 (10.20.0.127)
>>>              31886     169083516 (10.20.2.124)
>>>               3137     169083092 (10.20.0.212)
>>>
>>>
>>> # 10.20.0.212 # corosync-cpgtool
>>> Group Name           PID       Node ID
>>> r53clip
>>>              17891     169083092 (10.20.0.212)
>>>              21792     169083516 (10.20.2.124)
>>> hapi
>>>              17837     169083092 (10.20.0.212)
>>>              21717     169083516 (10.20.2.124)
>>> arbiter
>>>              21590     169083007 (10.20.0.127)
>>>              31886     169083516 (10.20.2.124)
>>>               3137     169083092 (10.20.0.212)
>>>
>>>
>>> # 10.20.0.127 # corosync-cpgtool
>>> Group Name           PID       Node ID
>>> r53clip
>>>              17891     169083092 (10.20.0.212)
>>>              21792     169083516 (10.20.2.124)
>>> hapi
>>>               7036     169083092 (10.20.0.212)
>>>              21717     169083516 (10.20.2.124)
>>>              17837     169083092 (10.20.0.212)
>>> arbiter
>>>              21590     169083007 (10.20.0.127)
>>>              31886     169083516 (10.20.2.124)
>>>               3137     169083092 (10.20.0.212)
>>>
>>> Notice the first 2 nodes report the same info, but the third node is
>>> reporting PID 7036 on 169083092. Logging into that box, there is no such
>>> process running.
>>>
>>> I have a capture of the corosync-blackbox data from all 3 nodes. Can
>>> provide if needed.
>>>
>>> corosync 2.3.2
>>> libqb 0.16.0
>>>
>>> I'll leave the nodes like this for a few hours if anyone responds and
>>> wants additional information. After that I'm going to bounce corosync to
>>> get everything running again.
>>>
>>> -Patrick
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss