Ok, I've uploaded data to S3. Links below. There shouldn't have been any splits. We haven't had any network interruption that I am aware of. I bounced corosync on the 10.20.0.127 node and everything cleared up. As this occurred in our development environment, there is a ton of background noise, so I'm unable to pinpoint exactly when the issue started. But I noticed it around 2014-02-07 01:00 GMT. blackbox: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync-blackbox.10.20.0.127.gz core: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.core.26622.gz log: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.log.gz Thanks -Patrick From: Jan Friesse <jfriesse@xxxxxxxxxx>
Sent: 2014-02-07 03:24:36 E
To: Patrick Hemmer <corosync@xxxxxxxxxxxxxxx>,
discuss@xxxxxxxxxxxx
Subject: Re: CPG reporting group member
that doesn't exist
Patrick, blackbox may be useful. Also log may help us trace what happened. This looks like some kind of problem when corosync nodes split and then join again... anyway, it's weird and looks like a bug. Another helpful thing may be coredump of corosync from affected node (so 10.20.0.127) to ensure it is not memory corruption problem. Regards, Honza Patrick Hemmer napsal(a):I've currently got a 3 node cluster with several processes on each box using CPG. CPG on one of the boxes is reporting a member of a group that isn't there. # 10.20.2.124 # corosync-cpgtool Group Name PID Node ID r53clip 17891 169083092 (10.20.0.212) 21792 169083516 (10.20.2.124) hapi 17837 169083092 (10.20.0.212) 21717 169083516 (10.20.2.124) arbiter 21590 169083007 (10.20.0.127) 31886 169083516 (10.20.2.124) 3137 169083092 (10.20.0.212) # 10.20.0.212 # corosync-cpgtool Group Name PID Node ID r53clip 17891 169083092 (10.20.0.212) 21792 169083516 (10.20.2.124) hapi 17837 169083092 (10.20.0.212) 21717 169083516 (10.20.2.124) arbiter 21590 169083007 (10.20.0.127) 31886 169083516 (10.20.2.124) 3137 169083092 (10.20.0.212) # 10.20.0.127 # corosync-cpgtool Group Name PID Node ID r53clip 17891 169083092 (10.20.0.212) 21792 169083516 (10.20.2.124) hapi 7036 169083092 (10.20.0.212) 21717 169083516 (10.20.2.124) 17837 169083092 (10.20.0.212) arbiter 21590 169083007 (10.20.0.127) 31886 169083516 (10.20.2.124) 3137 169083092 (10.20.0.212) Notice the first 2 nodes report the same info, but the third node is reporting PID 7036 on 169083092. Logging into that box, there is no such process running. I have a capture of the corosync-blackbox data from all 3 nodes. Can provide if needed. corosync 2.3.2 libqb 0.16.0 I'll leave the nodes like this for a few hours if anyone responds and wants additional information. After that I'm going to bounce corosync to get everything running again. -Patrick _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss |
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss