joined and failed list seem to be wrong after a node failure in a large ring

"Li, Qiuping (Qiuping)" <qiuping.li@xxxxxxxxxxxxxxxxxx> · Tue, 19 Jun 2012 14:49:32 -0500

Hi,

I have created a 32 node Corosync ring and did some node failure and recovery tests. I observed the following:

When I fail one or multiple nodes, corosync seems to report more node failures than the actual number of failed nodes, and therefore extra and wrong configuration callback functions are called. For example, when I reset just one node, I got the following
logs which indicate there were 31 nodes left the ring:

Jun 07 15:37:26 corosync [TOTEM ] A processor failed, forming new configuration.
Jun 07 15:37:30 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 07 15:37:30 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:31) --> it says 31 nodes left together,

This happens more often in a busier system and more often when reset more nodes at the same time.  

Note that we are using all default configurations suggested by Corosync.

Could this be a bug or a system configuration problem?

Thanks very much for the help.

Qiuping Li

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss