Re: joined and failed list seem to be wrong after a node failure in a large ring

"Li, Qiuping (Qiuping)" <qiuping.li@xxxxxxxxxxxxxxxxxx> · Thu, 21 Jun 2012 09:35:19 -0500

Hi,

Thanks for the information. I am using Centos 6.2. I don't know what is flatiron.

I did not have much time to repeat the tests with debug option turned on. But here is some of the default logs captured when I did one node reset.

>From the log, we can see that corosync thinks there are more than one node failed even if I only failed one node.

I only captured logs from one node. It looks to me that this node (the running node) did not get all the "join message" from all other nodes before the consensus timer out and then it tried with smaller list and reached consensus with a much smaller group. Later on other "join message" are received and it then quickly reached another couple rounds of consensus and then got the final consensus when everyone is in the group.

I am using default timers, i.e. token = 1000ms and consensus = 1200ms

Should I increase the values of those two timers? Any suggestions on what should be the typical values for a 32 node ring?

Thanks,

Qiuping

[root]# corosync-objctl|grep status 

runtime.totem.pg.mrp.srp.members.1.status=joined

runtime.totem.pg.mrp.srp.members.4113.status=joined

runtime.totem.pg.mrp.srp.members.4129.status=joined

runtime.totem.pg.mrp.srp.members.4145.status=left

runtime.totem.pg.mrp.srp.members.4161.status=left

runtime.totem.pg.mrp.srp.members.8193.status=joined

runtime.totem.pg.mrp.srp.members.8209.status=joined

runtime.totem.pg.mrp.srp.members.8225.status=joined

runtime.totem.pg.mrp.srp.members.8241.status=left

runtime.totem.pg.mrp.srp.members.8257.status=joined

runtime.totem.pg.mrp.srp.members.12289.status=left

runtime.totem.pg.mrp.srp.members.12305.status=left

runtime.totem.pg.mrp.srp.members.12321.status=left

runtime.totem.pg.mrp.srp.members.12337.status=left

runtime.totem.pg.mrp.srp.members.12353.status=left

runtime.totem.pg.mrp.srp.members.12369.status=left

runtime.totem.pg.mrp.srp.members.12385.status=left

runtime.totem.pg.mrp.srp.members.12401.status=joined

runtime.totem.pg.mrp.srp.members.12417.status=left

runtime.totem.pg.mrp.srp.members.12433.status=joined

runtime.totem.pg.mrp.srp.members.12449.status=joined

runtime.totem.pg.mrp.srp.members.12465.status=joined

runtime.totem.pg.mrp.srp.members.12481.status=joined

runtime.totem.pg.mrp.srp.members.12497.status=joined

runtime.totem.pg.mrp.srp.members.12513.status=joined

runtime.totem.pg.mrp.srp.members.12529.status=joined

runtime.totem.pg.mrp.srp.members.12545.status=joined

runtime.totem.pg.mrp.srp.members.12561.status=joined

runtime.totem.pg.mrp.srp.members.12577.status=joined

runtime.totem.pg.mrp.srp.members.12593.status=joined

runtime.totem.pg.mrp.srp.members.12609.status=joined

runtime.totem.pg.mrp.srp.members.4097.status=left

Corosync.log contains the following information:

Jun 21 08:04:30 corosync [TOTEM ] A processor failed, forming new configuration.

Jun 21 08:04:33 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:04:33 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:12)

Jun 21 08:04:33 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Jun 21 08:04:35 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:04:35 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:21 left:1)

Jun 21 08:04:35 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Jun 21 08:04:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:04:37 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:20 left:0)

Jun 21 08:04:37 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Jun 21 08:05:28 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:05:29 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:31 left:0)

Jun 21 08:05:29 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Jun 21 08:06:32 corosync [TOTEM ] A processor failed, forming new configuration.

Jun 21 08:06:35 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:06:36 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:06:37 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:12)

Jun 21 08:06:37 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Jun 21 08:06:38 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:06:38 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:20 left:0)

Jun 21 08:06:38 corosync [MAIN  ] Completed service synchronization, ready to provide service.

Jun 21 08:07:30 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.

Jun 21 08:07:31 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:31 left:0)

Jun 21 08:07:31 corosync [MAIN  ] Completed service synchronization, ready to provide service.

-----Original Message-----
From: Jan Friesse [mailto:jfriesse@xxxxxxxxxx] 
Sent: Wednesday, June 20, 2012 8:58 AM
To: Li, Qiuping (Qiuping)
Cc: discuss@xxxxxxxxxxxx
Subject: Re:  joined and failed list seem to be wrong after a node failure in a large ring

Hi,
what version are you using? If it's flatiron, can you please try apply 
three patches "flatiron cpg: *"? Can you please try to send log with 
debug informations enabled?

Regards,
   Honza

Li, Qiuping (Qiuping) napsal(a):
> Hi,
>
> I have created a 32 node Corosync ring and did some node failure and recovery tests. I observed the following:
>
> When I fail one or multiple nodes, corosync seems to report more node failures than the actual number of failed nodes, and therefore extra and wrong configuration callback functions are called. For example, when I reset just one node, I got the following logs which indicate there were 31 nodes left the ring:
>
> Jun 07 15:37:26 corosync [TOTEM ] A processor failed, forming new configuration.
> Jun 07 15:37:30 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
> Jun 07 15:37:30 corosync [CPG   ] chosen downlist: sender r(0) ip(169.254.0.1) ; members(old:32 left:31) -->  it says 31 nodes left together,
>
> This happens more often in a busier system and more often when reset more nodes at the same time.
>
> Note that we are using all default configurations suggested by Corosync.
>
> Could this be a bug or a system configuration problem?
>
> Thanks very much for the help.
>
> Qiuping Li
>
>
>
>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss