Patrick, thanks for files. Actually, I was able to find out logical bug in corosync source code. Easy way to reproduce problem: - 3 nodes running - Pause one node - On different node, stop cpg client and exec it again - Unpause node - Paused node will now have two entries (one old, one new) I will try to come up with patch. Thanks, Honza Patrick Hemmer napsal(a): > Ok, I've uploaded data to S3. Links below. > There shouldn't have been any splits. We haven't had any network > interruption that I am aware of. > I bounced corosync on the 10.20.0.127 node and everything cleared up. > > As this occurred in our development environment, there is a ton of > background noise, so I'm unable to pinpoint exactly when the issue > started. But I noticed it around 2014-02-07 01:00 GMT. > > blackbox: > https://s3.amazonaws.com/cloudcom-cliff-misc/corosync-blackbox.10.20.0.127.gz > core: > https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.core.26622.gz > log: > https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.log.gz > > Thanks > > -Patrick > > ------------------------------------------------------------------------ > *From: *Jan Friesse <jfriesse@xxxxxxxxxx> > *Sent: * 2014-02-07 03:24:36 E > *To: *Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>, discuss@xxxxxxxxxxxx > *Subject: *Re: CPG reporting group member that doesn't exist > >> Patrick, >> blackbox may be useful. Also log may help us trace what happened. This >> looks like some kind of problem when corosync nodes split and then join >> again... anyway, it's weird and looks like a bug. Another helpful thing >> may be coredump of corosync from affected node (so 10.20.0.127) to >> ensure it is not memory corruption problem. >> >> Regards, >> Honza >> >> >> Patrick Hemmer napsal(a): >>> I've currently got a 3 node cluster with several processes on each box >>> using CPG. CPG on one of the boxes is reporting a member of a group that >>> isn't there. >>> >>> # 10.20.2.124 # corosync-cpgtool >>> Group Name PID Node ID >>> r53clip >>> 17891 169083092 (10.20.0.212) >>> 21792 169083516 (10.20.2.124) >>> hapi >>> 17837 169083092 (10.20.0.212) >>> 21717 169083516 (10.20.2.124) >>> arbiter >>> 21590 169083007 (10.20.0.127) >>> 31886 169083516 (10.20.2.124) >>> 3137 169083092 (10.20.0.212) >>> >>> >>> # 10.20.0.212 # corosync-cpgtool >>> Group Name PID Node ID >>> r53clip >>> 17891 169083092 (10.20.0.212) >>> 21792 169083516 (10.20.2.124) >>> hapi >>> 17837 169083092 (10.20.0.212) >>> 21717 169083516 (10.20.2.124) >>> arbiter >>> 21590 169083007 (10.20.0.127) >>> 31886 169083516 (10.20.2.124) >>> 3137 169083092 (10.20.0.212) >>> >>> >>> # 10.20.0.127 # corosync-cpgtool >>> Group Name PID Node ID >>> r53clip >>> 17891 169083092 (10.20.0.212) >>> 21792 169083516 (10.20.2.124) >>> hapi >>> 7036 169083092 (10.20.0.212) >>> 21717 169083516 (10.20.2.124) >>> 17837 169083092 (10.20.0.212) >>> arbiter >>> 21590 169083007 (10.20.0.127) >>> 31886 169083516 (10.20.2.124) >>> 3137 169083092 (10.20.0.212) >>> >>> Notice the first 2 nodes report the same info, but the third node is >>> reporting PID 7036 on 169083092. Logging into that box, there is no such >>> process running. >>> >>> I have a capture of the corosync-blackbox data from all 3 nodes. Can >>> provide if needed. >>> >>> corosync 2.3.2 >>> libqb 0.16.0 >>> >>> I'll leave the nodes like this for a few hours if anyone responds and >>> wants additional information. After that I'm going to bounce corosync to >>> get everything running again. >>> >>> -Patrick >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >>> > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss