joined and failed list seem to be wrong after a node failure in a large ring

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,
 
I have created a 32 node Corosync ring and did some node failure and recovery tests. I observed the following:
 
When I fail one or multiple nodes, corosync seems to report more node failures than the actual number of failed nodes, and therefore extra and wrong configuration callback functions are called. For example, when I reset just one node, I got the following logs which indicate there were 31 nodes left the ring:
 
Jun 07 15:37:26 corosync [TOTEM ] A processor failed, forming new configuration.
Jun 07 15:37:30 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 07 15:37:30 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:31) --> it says 31 nodes left together,
 
This happens more often in a busier system and more often when reset more nodes at the same time. 
 
Note that we are using all default configurations suggested by Corosync.
 
Could this be a bug or a system configuration problem?
 
Thanks very much for the help.
 
Qiuping Li
 
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux