I found that corosync may trigger token lost timeout in RECOVERY state. The reason is that different nodes may spend different time to copy old ring messages from its regular_sort_queue to retrans_message_queue in *memb_state_recovery_enter* function. Let's have a look at the following logs(don't care about the line numbers, I have modify the source code for debugging): Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2034 my_memb:5, new_memb:5, trans_memb:5 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2037 TRANS [0] member 192.168.0.92: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2037 TRANS [1] member 192.168.0.124: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2037 TRANS [2] member 192.168.0.142: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2037 TRANS [3] member 192.168.0.160: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2037 TRANS [4] member 192.168.0.164: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2041 position [0] member 192.168.0.92: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2045 previous ring seq 1161924 rep 192.168.0.92 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2051 aru 46695 high delivered 46695 received flag 1 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2041 position [1] member 192.168.0.124: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2045 previous ring seq 1161924 rep 192.168.0.92 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2051 aru 46695 high delivered 46695 received flag 1 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2041 position [2] member 192.168.0.142: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2045 previous ring seq 1161924 rep 192.168.0.92 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2051 aru 42725 high delivered 42725 received flag 0 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2041 position [3] member 192.168.0.160: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2045 previous ring seq 1161924 rep 192.168.0.92 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2051 aru 46695 high delivered 46695 received flag 1 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2041 position [4] member 192.168.0.164: Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2045 previous ring seq 1161924 rep 192.168.0.92 Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2051 aru 46695 high delivered 46695 received flag 1 We can see that there was five nodes forming a new ring at 00:51:53, they had different aru value and node(192.168.0.142) had the smallest aru value. Here is some logs comes from these five nodes, we can see how much time they spend for copying old ring messages: [logs from 192.168.0.92] Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2111 copying all old ring messages from 42726-46695. Nov 30 00:52:06 corosync [TOTEM ] totemsrp.c:2145 Originated 16240 messages in RECOVERY. ==>Costed 13 seconds to copy all old ring messages [logs from 192.168.0.124] Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2111 copying all old ring messages from 42726-46695. Nov 30 00:52:09 corosync [TOTEM ] totemsrp.c:2145 Originated 16240 messages in RECOVERY. ==>Costed 16 seconds to copy all old ring messages [logs from 192.168.0.142] Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2111 copying all old ring messages from 42726-46693. Nov 30 00:51:56 corosync [TOTEM ] totemsrp.c:2145 Originated 2321 messages in RECOVERY. ==>Costed 3 seconds to copy all old ring messages [logs from 192.168.0.160] Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2111 copying all old ring messages from 42726-46695. Nov 30 00:51:58 corosync [TOTEM ] totemsrp.c:2145 Originated 16240 messages in RECOVERY. ==>Costed 5 seconds to copy all old ring messages [192.168.0.164] Nov 30 00:51:53 corosync [TOTEM ] totemsrp.c:2111 copying all old ring messages from 42726-46695. Nov 30 00:52:05 corosync [TOTEM ] totemsrp.c:2145 Originated 16240 messages in RECOVERY. ==>Costed 12 seconds to copy all old ring messages After the node(192.168.0.142) costed 3 seconds to copy all messages, it start waiting for the commit_token from it's upstream(192.168.0.124), but it could not received as it's upstream(192.168.0.124) was busy copying messages at that time between 00:51:56 to 00:52:09. So this node(192.168.0.142) would finally trigger token lost timeout in RECOVERY state. More seriously, after it triggered token lost timeout, the node(192.168.0.142) enter GATHER state, but it could not receive JoinMSG from other nodes because others was also busy at that time, and then this node ran into consensus expired timeout state finally. To solve this issue, we can cancel token_timeout (but keeping token_retransmit_timeout) in RECOVERY state. Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> --- exec/totemsrp.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/exec/totemsrp.c b/exec/totemsrp.c index 5a78962..e0299e7 100644 --- a/exec/totemsrp.c +++ b/exec/totemsrp.c @@ -2206,7 +2206,7 @@ originated: instance->my_install_seq = SEQNO_START_MSG; instance->last_released = SEQNO_START_MSG; - reset_token_timeout (instance); // REVIEWED + cancel_token_timeout (instance); reset_token_retransmit_timeout (instance); // REVIEWED instance->memb_state = MEMB_STATE_RECOVERY; -- 1.7.7.3 _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss