Hi Steven Dake, Thank you for your reply. I would change this discussion CC to discuss@xxxxxxxxxxxx instead of openais@xxxxxxxxxxxxxx as we known openais@xxxxxxxxxxxxxx could not work. On Tue, Oct 11, 2011 at 10:54 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: > On 10/10/2011 08:58 PM, qiushu.zyk@xxxxxxxxxx wrote: >> From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >> >> In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abormal exit >> would occur when consensus timeout expired and if there was no other processors >> in consensus_list. >> >> = analysis = >> 1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired >> function. >> >> 2. if its consensus_list only contain my_id, the code would >> execute memb_set_merge which make my_failed_list equal to >> my_proc_list. >> >> 3. call memb_state_gather_enter function and mcast a join message which >> contain proc_list and failed_list with the same processor IDs. >> >> 4. the join message would be received by itself and >> memb_join_process/memb_consensus_agreed would by called. >> >> 5. because the proc_list equal to failed_list in the join message, >> the assert instruction will be reached, then an abnormal exit occur. >> >> = solution = >> This patch try to resolve this issue by remove my_id from >> my_failed_list before corosync call memb_state_gather_enter from >> memb_state_consensus_timeout_expired. >> >> when network partition occur and the processor can't communicate with >> all other processors, it will form an single ring with only itself other >> than trigger abnormal exit. >> > > Yunkai, > > Thank you for the patch. I am really hesitant to make any changes to > totemsrp that I haven't thought long and hard about. Your solution is > clever and well thought out, but totem has thousands of details - to the > point that I always want to get to the root cause of the issue when > fixing problems. > > I think your running into a "FAILED TO RECV" state. There is a patch > outstanding for this issue but we have been unable to find anyone to > test it. Our environments don't demonstrate a failed to receive scenario. > > Can you verify if you have FAILED TO RECV (should be in the fplay data). > Yes, corosync will run into a "FAILED TO RECV" state in message_handler_orf_token function. We can see the last three lines of logging messages from /var/log/cluster/corosync.log as following: Aug 25 13:42:25 corosync [TOTEM ] FAILED TO RECEIVE Aug 25 13:42:25 corosync [TOTEM ] entering GATHER state from 6. Aug 25 13:42:27 corosync [TOTEM ] entering GATHER state from 0. According our testing, we can duplicate this issue so easily with two conditions: 1). more than 20 nodes (not exactly). 2). increasing the network load which will cause broadcasting message(regular msg/join msg) failed frequently. > If so, can you run with the patch: > https://bugzilla.redhat.com/show_bug.cgi?id=636583 comment #8 Thank for your patch, I plan to test it next week as I have no enough nodes to test now. > > Thanks! > -steve > > >> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >> --- >> exec/totemsrp.c | 33 +++++++++++++++++++++++++++++++++ >> 1 files changed, 33 insertions(+), 0 deletions(-) >> >> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >> index 0778d55..653f801 100644 >> --- a/exec/totemsrp.c >> +++ b/exec/totemsrp.c >> @@ -1324,6 +1324,33 @@ static void memb_set_merge ( >> return; >> } >> >> +/* >> + * remove subset from fullset >> + */ >> +static void memb_set_remove( >> + const struct srp_addr *subset, int subset_entries, >> + struct srp_addr *fullset, int *fullset_entries) >> +{ >> + int found = 0; >> + int i; >> + int j; >> + >> + for (i = 0; i < subset_entries; i++) { >> + for (j = 0; j < *fullset_entries; j++) { >> + if (srp_addr_equal (&fullset[j], &subset[i])) { >> + found = 1; >> + break; >> + } >> + } >> + if (found == 1) { >> + for (; j < (*fullset_entries-1); j++) { >> + srp_addr_copy (&fullset[j], &fullset[j+1]); >> + } >> + *fullset_entries = *fullset_entries - 1; >> + } >> + } >> +} >> + >> static void memb_set_and_with_ring_id ( >> struct srp_addr *set1, >> struct memb_ring_id *set1_ring_ids, >> @@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired ( >> >> memb_set_merge (no_consensus_list, no_consensus_list_entries, >> instance->my_failed_list, &instance->my_failed_list_entries); >> + >> + if (instance->my_proc_list_entries == instance->my_failed_list_entries){ >> + memb_set_remove (&instance->my_id, 1, >> + instance->my_failed_list, &instance->my_failed_list_entries); >> + } >> + >> memb_state_gather_enter (instance, 0); >> } >> } > > -- Yunkai Zhang work at taobao.com _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss