From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abnormal exit would occur when consensus timeout expired and if there was no other processors in consensus_list. = analysis = 1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired function. 2. if its consensus_list only contain my_id, the code would execute memb_set_merge which make my_failed_list equal to my_proc_list. 3. call memb_state_gather_enter function and mcast a join message which contain proc_list and failed_list with the same processor IDs. 4. the join message would be received by itself and memb_join_process/memb_consensus_agreed would by called. 5. because the proc_list equal to failed_list in the join message, the assert instruction will be reached, then an abnormal exit occur. = solution = This patch try to resolve this issue by remove my_id from my_failed_list before corosync call memb_state_gather_enter from memb_state_consensus_timeout_expired. when network partition occur and the processor can't communicate with all other processors, it will form an single ring with only itself other than trigger abnormal exit. Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> --- exec/totemsrp.c | 33 +++++++++++++++++++++++++++++++++ 1 files changed, 33 insertions(+), 0 deletions(-) diff --git a/exec/totemsrp.c b/exec/totemsrp.c index 0778d55..653f801 100644 --- a/exec/totemsrp.c +++ b/exec/totemsrp.c @@ -1324,6 +1324,33 @@ static void memb_set_merge ( return; } +/* + * remove subset from fullset + */ +static void memb_set_remove( + const struct srp_addr *subset, int subset_entries, + struct srp_addr *fullset, int *fullset_entries) +{ + int found = 0; + int i; + int j; + + for (i = 0; i < subset_entries; i++) { + for (j = 0; j < *fullset_entries; j++) { + if (srp_addr_equal (&fullset[j], &subset[i])) { + found = 1; + break; + } + } + if (found == 1) { + for (; j < (*fullset_entries-1); j++) { + srp_addr_copy (&fullset[j], &fullset[j+1]); + } + *fullset_entries = *fullset_entries - 1; + } + } +} + static void memb_set_and_with_ring_id ( struct srp_addr *set1, struct memb_ring_id *set1_ring_ids, @@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired ( memb_set_merge (no_consensus_list, no_consensus_list_entries, instance->my_failed_list, &instance->my_failed_list_entries); + + if (instance->my_proc_list_entries == instance->my_failed_list_entries){ + memb_set_remove (&instance->my_id, 1, + instance->my_failed_list, &instance->my_failed_list_entries); + } + memb_state_gather_enter (instance, 0); } } -- 1.7.6.4 _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss