Re: CPG reporting group member that doesn't exist

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Ok, I've uploaded data to S3. Links below.
There shouldn't have been any splits. We haven't had any network interruption that I am aware of.
I bounced corosync on the 10.20.0.127 node and everything cleared up.

As this occurred in our development environment, there is a ton of background noise, so I'm unable to pinpoint exactly when the issue started. But I noticed it around 2014-02-07 01:00 GMT.

blackbox: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync-blackbox.10.20.0.127.gz
core: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.core.26622.gz
log: https://s3.amazonaws.com/cloudcom-cliff-misc/corosync.10.20.0.127.log.gz

Thanks

-Patrick


From: Jan Friesse <jfriesse@xxxxxxxxxx>
Sent: 2014-02-07 03:24:36 E
Subject: Re: CPG reporting group member that doesn't exist

Patrick,
blackbox may be useful. Also log may help us trace what happened. This
looks like some kind of problem when corosync nodes split and then join
again... anyway, it's weird and looks like a bug. Another helpful thing
may be coredump of corosync from affected node (so 10.20.0.127) to
ensure it is not memory corruption problem.

Regards,
  Honza


Patrick Hemmer napsal(a):
I've currently got a 3 node cluster with several processes on each box
using CPG. CPG on one of the boxes is reporting a member of a group that
isn't there.

# 10.20.2.124 # corosync-cpgtool
Group Name           PID       Node ID
r53clip
             17891     169083092 (10.20.0.212)
             21792     169083516 (10.20.2.124)
hapi
             17837     169083092 (10.20.0.212)
             21717     169083516 (10.20.2.124)
arbiter
             21590     169083007 (10.20.0.127)
             31886     169083516 (10.20.2.124)
              3137     169083092 (10.20.0.212)


# 10.20.0.212 # corosync-cpgtool
Group Name           PID       Node ID
r53clip
             17891     169083092 (10.20.0.212)
             21792     169083516 (10.20.2.124)
hapi
             17837     169083092 (10.20.0.212)
             21717     169083516 (10.20.2.124)
arbiter
             21590     169083007 (10.20.0.127)
             31886     169083516 (10.20.2.124)
              3137     169083092 (10.20.0.212)


# 10.20.0.127 # corosync-cpgtool
Group Name           PID       Node ID
r53clip
             17891     169083092 (10.20.0.212)
             21792     169083516 (10.20.2.124)
hapi
              7036     169083092 (10.20.0.212)
             21717     169083516 (10.20.2.124)
             17837     169083092 (10.20.0.212)
arbiter
             21590     169083007 (10.20.0.127)
             31886     169083516 (10.20.2.124)
              3137     169083092 (10.20.0.212)

Notice the first 2 nodes report the same info, but the third node is
reporting PID 7036 on 169083092. Logging into that box, there is no such
process running.

I have a capture of the corosync-blackbox data from all 3 nodes. Can
provide if needed.

corosync 2.3.2
libqb 0.16.0

I'll leave the nodes like this for a few hours if anyone responds and
wants additional information. After that I'm going to bounce corosync to
get everything running again.

-Patrick



_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss


    

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux