Re: CPG reporting group member that doesn't exist

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Patrick,
blackbox may be useful. Also log may help us trace what happened. This
looks like some kind of problem when corosync nodes split and then join
again... anyway, it's weird and looks like a bug. Another helpful thing
may be coredump of corosync from affected node (so 10.20.0.127) to
ensure it is not memory corruption problem.

Regards,
  Honza


Patrick Hemmer napsal(a):
> I've currently got a 3 node cluster with several processes on each box
> using CPG. CPG on one of the boxes is reporting a member of a group that
> isn't there.
> 
> # 10.20.2.124 # corosync-cpgtool
> Group Name           PID       Node ID
> r53clip
>              17891     169083092 (10.20.0.212)
>              21792     169083516 (10.20.2.124)
> hapi
>              17837     169083092 (10.20.0.212)
>              21717     169083516 (10.20.2.124)
> arbiter
>              21590     169083007 (10.20.0.127)
>              31886     169083516 (10.20.2.124)
>               3137     169083092 (10.20.0.212)
> 
> 
> # 10.20.0.212 # corosync-cpgtool
> Group Name           PID       Node ID
> r53clip
>              17891     169083092 (10.20.0.212)
>              21792     169083516 (10.20.2.124)
> hapi
>              17837     169083092 (10.20.0.212)
>              21717     169083516 (10.20.2.124)
> arbiter
>              21590     169083007 (10.20.0.127)
>              31886     169083516 (10.20.2.124)
>               3137     169083092 (10.20.0.212)
> 
> 
> # 10.20.0.127 # corosync-cpgtool
> Group Name           PID       Node ID
> r53clip
>              17891     169083092 (10.20.0.212)
>              21792     169083516 (10.20.2.124)
> hapi
>               7036     169083092 (10.20.0.212)
>              21717     169083516 (10.20.2.124)
>              17837     169083092 (10.20.0.212)
> arbiter
>              21590     169083007 (10.20.0.127)
>              31886     169083516 (10.20.2.124)
>               3137     169083092 (10.20.0.212)
> 
> Notice the first 2 nodes report the same info, but the third node is
> reporting PID 7036 on 169083092. Logging into that box, there is no such
> process running.
> 
> I have a capture of the corosync-blackbox data from all 3 nodes. Can
> provide if needed.
> 
> corosync 2.3.2
> libqb 0.16.0
> 
> I'll leave the nodes like this for a few hours if anyone responds and
> wants additional information. After that I'm going to bounce corosync to
> get everything running again.
> 
> -Patrick
> 
> 
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss




[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux