Re: joined and failed list seem to be wrong after a node failure in a large ring

Jan Friesse <jfriesse@xxxxxxxxxx> · Wed, 20 Jun 2012 14:57:59 +0200

Hi,
what version are you using? If it's flatiron, can you please try apply 
three patches "flatiron cpg: *"? Can you please try to send log with 
debug informations enabled?

Regards,
  Honza

Li, Qiuping (Qiuping) napsal(a):
Hi,

I have created a 32 node Corosync ring and did some node failure and recovery tests. I observed the following:

When I fail one or multiple nodes, corosync seems to report more node failures than the actual number of failed nodes, and therefore extra and wrong configuration callback functions are called. For example, when I reset just one node, I got the following logs which indicate there were 31 nodes left the ring:

Jun 07 15:37:26 corosync [TOTEM ] A processor failed, forming new configuration.
Jun 07 15:37:30 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jun 07 15:37:30 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:31) -->  it says 31 nodes left together,

This happens more often in a busier system and more often when reset more nodes at the same time.

Note that we are using all default configurations suggested by Corosync.

Could this be a bug or a system configuration problem?

Thanks very much for the help.

Qiuping Li

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss