On 10/25/2011 05:32 AM, Yunkai Zhang wrote: > Hi Steven Dake, > > We have tested your patch last week for a long time, but I am so sorry > to tell you that we could not duplicate the issue again---it never run > into "FAIL" state. The corosync runs so strongly beyond all our > expectation. > Then issue could not be duplicated with patch? But without patch, your not sure if problem still occurs? Regards -steve > It's both a good news and a bad news :( > > Maybe there is unknown condition to duplicate this issue, and I will > continue to monitor it. > > On Wed, Oct 12, 2011 at 8:24 PM, Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> wrote: >> Hi Steven Dake, >> >> Thank you for your reply. >> I would change this discussion CC to discuss@xxxxxxxxxxxx instead of >> openais@xxxxxxxxxxxxxx >> as we known openais@xxxxxxxxxxxxxx could not work. >> >> On Tue, Oct 11, 2011 at 10:54 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>> On 10/10/2011 08:58 PM, qiushu.zyk@xxxxxxxxxx wrote: >>>> From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>> >>>> In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abormal exit >>>> would occur when consensus timeout expired and if there was no other processors >>>> in consensus_list. >>>> >>>> = analysis = >>>> 1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired >>>> function. >>>> >>>> 2. if its consensus_list only contain my_id, the code would >>>> execute memb_set_merge which make my_failed_list equal to >>>> my_proc_list. >>>> >>>> 3. call memb_state_gather_enter function and mcast a join message which >>>> contain proc_list and failed_list with the same processor IDs. >>>> >>>> 4. the join message would be received by itself and >>>> memb_join_process/memb_consensus_agreed would by called. >>>> >>>> 5. because the proc_list equal to failed_list in the join message, >>>> the assert instruction will be reached, then an abnormal exit occur. >>>> >>>> = solution = >>>> This patch try to resolve this issue by remove my_id from >>>> my_failed_list before corosync call memb_state_gather_enter from >>>> memb_state_consensus_timeout_expired. >>>> >>>> when network partition occur and the processor can't communicate with >>>> all other processors, it will form an single ring with only itself other >>>> than trigger abnormal exit. >>>> >>> >>> Yunkai, >>> >>> Thank you for the patch. I am really hesitant to make any changes to >>> totemsrp that I haven't thought long and hard about. Your solution is >>> clever and well thought out, but totem has thousands of details - to the >>> point that I always want to get to the root cause of the issue when >>> fixing problems. >>> >>> I think your running into a "FAILED TO RECV" state. There is a patch >>> outstanding for this issue but we have been unable to find anyone to >>> test it. Our environments don't demonstrate a failed to receive scenario. >>> >>> Can you verify if you have FAILED TO RECV (should be in the fplay data). >>> >> >> Yes, corosync will run into a "FAILED TO RECV" state in >> message_handler_orf_token function. >> We can see the last three lines of logging messages from >> /var/log/cluster/corosync.log as following: >> >> Aug 25 13:42:25 corosync [TOTEM ] FAILED TO RECEIVE >> Aug 25 13:42:25 corosync [TOTEM ] entering GATHER state from 6. >> Aug 25 13:42:27 corosync [TOTEM ] entering GATHER state from 0. >> >> According our testing, we can duplicate this issue so easily with >> two conditions: >> 1). more than 20 nodes (not exactly). >> 2). increasing the network load which will cause broadcasting >> message(regular msg/join msg) failed frequently. >> >>> If so, can you run with the patch: >>> https://bugzilla.redhat.com/show_bug.cgi?id=636583 comment #8 >> >> Thank for your patch, >> I plan to test it next week as I have no enough nodes to test now. >> >>> >>> Thanks! >>> -steve >>> >>> >>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>> --- >>>> exec/totemsrp.c | 33 +++++++++++++++++++++++++++++++++ >>>> 1 files changed, 33 insertions(+), 0 deletions(-) >>>> >>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >>>> index 0778d55..653f801 100644 >>>> --- a/exec/totemsrp.c >>>> +++ b/exec/totemsrp.c >>>> @@ -1324,6 +1324,33 @@ static void memb_set_merge ( >>>> return; >>>> } >>>> >>>> +/* >>>> + * remove subset from fullset >>>> + */ >>>> +static void memb_set_remove( >>>> + const struct srp_addr *subset, int subset_entries, >>>> + struct srp_addr *fullset, int *fullset_entries) >>>> +{ >>>> + int found = 0; >>>> + int i; >>>> + int j; >>>> + >>>> + for (i = 0; i < subset_entries; i++) { >>>> + for (j = 0; j < *fullset_entries; j++) { >>>> + if (srp_addr_equal (&fullset[j], &subset[i])) { >>>> + found = 1; >>>> + break; >>>> + } >>>> + } >>>> + if (found == 1) { >>>> + for (; j < (*fullset_entries-1); j++) { >>>> + srp_addr_copy (&fullset[j], &fullset[j+1]); >>>> + } >>>> + *fullset_entries = *fullset_entries - 1; >>>> + } >>>> + } >>>> +} >>>> + >>>> static void memb_set_and_with_ring_id ( >>>> struct srp_addr *set1, >>>> struct memb_ring_id *set1_ring_ids, >>>> @@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired ( >>>> >>>> memb_set_merge (no_consensus_list, no_consensus_list_entries, >>>> instance->my_failed_list, &instance->my_failed_list_entries); >>>> + >>>> + if (instance->my_proc_list_entries == instance->my_failed_list_entries){ >>>> + memb_set_remove (&instance->my_id, 1, >>>> + instance->my_failed_list, &instance->my_failed_list_entries); >>>> + } >>>> + >>>> memb_state_gather_enter (instance, 0); >>>> } >>>> } >>> >>> >> >> >> >> -- >> Yunkai Zhang >> work at taobao.com >> > > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss