On Tue, Oct 25, 2011 at 9:57 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: > On 10/25/2011 05:32 AM, Yunkai Zhang wrote: >> Hi Steven Dake, >> >> We have tested your patch last week for a long time, but I am so sorry >> to tell you that we could not duplicate the issue again---it never run >> into "FAIL" state. The corosync runs so strongly beyond all our >> expectation. >> > > Then issue could not be duplicated with patch? But without patch, your > not sure if problem still occurs? Yes, I have tested in two way: with patch and without patch. And I have added log_printf before the patch to ensure that we can observe when the program run into the patch code. I tested it with my coworker: "Yinbin" <zituan@xxxxxxxxxx>, and we have tested it not less than 10 times. Now, I guess that this issue maybe could olny occur again after corosync have been running for more than 3 days. I plan to keep corosync running for more days in our later testing. > > Regards > -steve > >> It's both a good news and a bad news :( >> >> Maybe there is unknown condition to duplicate this issue, and I will >> continue to monitor it. >> >> On Wed, Oct 12, 2011 at 8:24 PM, Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> wrote: >>> Hi Steven Dake, >>> >>> Thank you for your reply. >>> I would change this discussion CC to discuss@xxxxxxxxxxxx instead of >>> openais@xxxxxxxxxxxxxx >>> as we known openais@xxxxxxxxxxxxxx could not work. >>> >>> On Tue, Oct 11, 2011 at 10:54 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>> On 10/10/2011 08:58 PM, qiushu.zyk@xxxxxxxxxx wrote: >>>>> From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>>> >>>>> In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abormal exit >>>>> would occur when consensus timeout expired and if there was no other processors >>>>> in consensus_list. >>>>> >>>>> = analysis = >>>>> 1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired >>>>> function. >>>>> >>>>> 2. if its consensus_list only contain my_id, the code would >>>>> execute memb_set_merge which make my_failed_list equal to >>>>> my_proc_list. >>>>> >>>>> 3. call memb_state_gather_enter function and mcast a join message which >>>>> contain proc_list and failed_list with the same processor IDs. >>>>> >>>>> 4. the join message would be received by itself and >>>>> memb_join_process/memb_consensus_agreed would by called. >>>>> >>>>> 5. because the proc_list equal to failed_list in the join message, >>>>> the assert instruction will be reached, then an abnormal exit occur. >>>>> >>>>> = solution = >>>>> This patch try to resolve this issue by remove my_id from >>>>> my_failed_list before corosync call memb_state_gather_enter from >>>>> memb_state_consensus_timeout_expired. >>>>> >>>>> when network partition occur and the processor can't communicate with >>>>> all other processors, it will form an single ring with only itself other >>>>> than trigger abnormal exit. >>>>> >>>> >>>> Yunkai, >>>> >>>> Thank you for the patch. I am really hesitant to make any changes to >>>> totemsrp that I haven't thought long and hard about. Your solution is >>>> clever and well thought out, but totem has thousands of details - to the >>>> point that I always want to get to the root cause of the issue when >>>> fixing problems. >>>> >>>> I think your running into a "FAILED TO RECV" state. There is a patch >>>> outstanding for this issue but we have been unable to find anyone to >>>> test it. Our environments don't demonstrate a failed to receive scenario. >>>> >>>> Can you verify if you have FAILED TO RECV (should be in the fplay data). >>>> >>> >>> Yes, corosync will run into a "FAILED TO RECV" state in >>> message_handler_orf_token function. >>> We can see the last three lines of logging messages from >>> /var/log/cluster/corosync.log as following: >>> >>> Aug 25 13:42:25 corosync [TOTEM ] FAILED TO RECEIVE >>> Aug 25 13:42:25 corosync [TOTEM ] entering GATHER state from 6. >>> Aug 25 13:42:27 corosync [TOTEM ] entering GATHER state from 0. >>> >>> According our testing, we can duplicate this issue so easily with >>> two conditions: >>> 1). more than 20 nodes (not exactly). >>> 2). increasing the network load which will cause broadcasting >>> message(regular msg/join msg) failed frequently. >>> >>>> If so, can you run with the patch: >>>> https://bugzilla.redhat.com/show_bug.cgi?id=636583 comment #8 >>> >>> Thank for your patch, >>> I plan to test it next week as I have no enough nodes to test now. >>> >>>> >>>> Thanks! >>>> -steve >>>> >>>> >>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>>> --- >>>>> exec/totemsrp.c | 33 +++++++++++++++++++++++++++++++++ >>>>> 1 files changed, 33 insertions(+), 0 deletions(-) >>>>> >>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >>>>> index 0778d55..653f801 100644 >>>>> --- a/exec/totemsrp.c >>>>> +++ b/exec/totemsrp.c >>>>> @@ -1324,6 +1324,33 @@ static void memb_set_merge ( >>>>> return; >>>>> } >>>>> >>>>> +/* >>>>> + * remove subset from fullset >>>>> + */ >>>>> +static void memb_set_remove( >>>>> + const struct srp_addr *subset, int subset_entries, >>>>> + struct srp_addr *fullset, int *fullset_entries) >>>>> +{ >>>>> + int found = 0; >>>>> + int i; >>>>> + int j; >>>>> + >>>>> + for (i = 0; i < subset_entries; i++) { >>>>> + for (j = 0; j < *fullset_entries; j++) { >>>>> + if (srp_addr_equal (&fullset[j], &subset[i])) { >>>>> + found = 1; >>>>> + break; >>>>> + } >>>>> + } >>>>> + if (found == 1) { >>>>> + for (; j < (*fullset_entries-1); j++) { >>>>> + srp_addr_copy (&fullset[j], &fullset[j+1]); >>>>> + } >>>>> + *fullset_entries = *fullset_entries - 1; >>>>> + } >>>>> + } >>>>> +} >>>>> + >>>>> static void memb_set_and_with_ring_id ( >>>>> struct srp_addr *set1, >>>>> struct memb_ring_id *set1_ring_ids, >>>>> @@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired ( >>>>> >>>>> memb_set_merge (no_consensus_list, no_consensus_list_entries, >>>>> instance->my_failed_list, &instance->my_failed_list_entries); >>>>> + >>>>> + if (instance->my_proc_list_entries == instance->my_failed_list_entries){ >>>>> + memb_set_remove (&instance->my_id, 1, >>>>> + instance->my_failed_list, &instance->my_failed_list_entries); >>>>> + } >>>>> + >>>>> memb_state_gather_enter (instance, 0); >>>>> } >>>>> } >>>> >>>> >>> >>> >>> >>> -- >>> Yunkai Zhang >>> work at taobao.com >>> >> >> >> > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss > -- Yunkai Zhang Work at Taobao _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss