Re: [RFC PATCH] Resolve an abnormal exit when consensus timeout expired.

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Tue, 25 Oct 2011 22:32:43 +0800

On Tue, Oct 25, 2011 at 9:57 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 10/25/2011 05:32 AM, Yunkai Zhang wrote:
>> Hi Steven Dake,
>>
>> We have tested your patch last week for a long time, but I am so sorry
>> to tell you that we could not duplicate the issue again---it never run
>> into "FAIL" state. The corosync runs so strongly  beyond all our
>> expectation.
>>
>
> Then issue could not be duplicated with patch?  But without patch, your
> not sure if problem still occurs?

Yes, I have tested in two way: with patch and without patch. And I
have added log_printf before the patch to ensure that we can observe
when the program run into the patch code.

I tested it with my coworker:  "Yinbin" <zituan@xxxxxxxxxx>,  and we
have tested it not less than 10 times. Now, I guess that this issue
maybe could olny occur again after corosync have been running for more
than 3 days.

I plan to keep corosync running for more days in our later testing.

>
> Regards
> -steve
>
>> It's both a good news and a bad news :(
>>
>> Maybe there is unknown condition to duplicate this issue, and I will
>> continue to monitor it.
>>
>> On Wed, Oct 12, 2011 at 8:24 PM, Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> wrote:
>>> Hi Steven Dake,
>>>
>>> Thank you for your reply.
>>> I would change this discussion CC to discuss@xxxxxxxxxxxx instead of
>>> openais@xxxxxxxxxxxxxx
>>> as we known openais@xxxxxxxxxxxxxx could not work.
>>>
>>> On Tue, Oct 11, 2011 at 10:54 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>> On 10/10/2011 08:58 PM, qiushu.zyk@xxxxxxxxxx wrote:
>>>>> From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>>>
>>>>> In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abormal exit
>>>>> would occur when consensus timeout expired and if there was no other processors
>>>>> in consensus_list.
>>>>>
>>>>> = analysis =
>>>>> 1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired
>>>>>    function.
>>>>>
>>>>> 2. if its consensus_list only contain my_id, the code would
>>>>>    execute memb_set_merge which make my_failed_list equal to
>>>>>    my_proc_list.
>>>>>
>>>>> 3. call memb_state_gather_enter function and mcast a join message which
>>>>>    contain proc_list and failed_list with the same processor IDs.
>>>>>
>>>>> 4. the join message would be received by itself and
>>>>>    memb_join_process/memb_consensus_agreed would by called.
>>>>>
>>>>> 5. because the proc_list equal to failed_list in the join message,
>>>>>    the assert instruction will be reached, then an abnormal exit occur.
>>>>>
>>>>> = solution =
>>>>> This patch try to resolve this issue by remove my_id from
>>>>> my_failed_list before corosync call memb_state_gather_enter from
>>>>> memb_state_consensus_timeout_expired.
>>>>>
>>>>> when network partition occur and the processor can't communicate with
>>>>> all other processors, it will form an single ring with only itself other
>>>>> than trigger abnormal exit.
>>>>>
>>>>
>>>> Yunkai,
>>>>
>>>> Thank you for the patch.  I am really hesitant to make any changes to
>>>> totemsrp that I haven't thought long and hard about.  Your solution is
>>>> clever and well thought out, but totem has thousands of details - to the
>>>> point that I always want to get to the root cause of the issue when
>>>> fixing problems.
>>>>
>>>> I think your running into a "FAILED TO RECV" state.  There is a patch
>>>> outstanding for this issue but we have been unable to find anyone to
>>>> test it.  Our environments don't demonstrate a failed to receive scenario.
>>>>
>>>> Can you verify if you have FAILED TO RECV (should be in the fplay data).
>>>>
>>>
>>> Yes, corosync will run into a "FAILED TO RECV" state in
>>> message_handler_orf_token function.
>>> We can see the last three lines of logging messages from
>>> /var/log/cluster/corosync.log as following:
>>>
>>>   Aug 25 13:42:25 corosync [TOTEM ] FAILED TO RECEIVE
>>>   Aug 25 13:42:25 corosync [TOTEM ] entering GATHER state from 6.
>>>   Aug 25 13:42:27 corosync [TOTEM ] entering GATHER state from 0.
>>>
>>> According our testing, we can duplicate this issue so easily with
>>> two conditions:
>>> 1). more than 20 nodes (not exactly).
>>> 2). increasing the network load  which will cause broadcasting
>>> message(regular msg/join msg) failed  frequently.
>>>
>>>> If so, can you run with the patch:
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=636583 comment #8
>>>
>>> Thank for your patch,
>>> I plan to test it next week as I have no enough nodes to test now.
>>>
>>>>
>>>> Thanks!
>>>> -steve
>>>>
>>>>
>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>>> ---
>>>>>  exec/totemsrp.c |   33 +++++++++++++++++++++++++++++++++
>>>>>  1 files changed, 33 insertions(+), 0 deletions(-)
>>>>>
>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>>> index 0778d55..653f801 100644
>>>>> --- a/exec/totemsrp.c
>>>>> +++ b/exec/totemsrp.c
>>>>> @@ -1324,6 +1324,33 @@ static void memb_set_merge (
>>>>>       return;
>>>>>  }
>>>>>
>>>>> +/*
>>>>> + * remove subset from fullset
>>>>> + */
>>>>> +static void memb_set_remove(
>>>>> +     const struct srp_addr *subset, int subset_entries,
>>>>> +     struct srp_addr *fullset, int *fullset_entries)
>>>>> +{
>>>>> +     int found = 0;
>>>>> +     int i;
>>>>> +     int j;
>>>>> +
>>>>> +     for (i = 0; i < subset_entries; i++) {
>>>>> +             for (j = 0; j < *fullset_entries; j++) {
>>>>> +                     if (srp_addr_equal (&fullset[j], &subset[i])) {
>>>>> +                             found = 1;
>>>>> +                             break;
>>>>> +                     }
>>>>> +             }
>>>>> +             if (found == 1) {
>>>>> +                     for (; j < (*fullset_entries-1); j++) {
>>>>> +                             srp_addr_copy (&fullset[j], &fullset[j+1]);
>>>>> +                     }
>>>>> +                     *fullset_entries = *fullset_entries - 1;
>>>>> +             }
>>>>> +     }
>>>>> +}
>>>>> +
>>>>>  static void memb_set_and_with_ring_id (
>>>>>       struct srp_addr *set1,
>>>>>       struct memb_ring_id *set1_ring_ids,
>>>>> @@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired (
>>>>>
>>>>>               memb_set_merge (no_consensus_list, no_consensus_list_entries,
>>>>>                       instance->my_failed_list, &instance->my_failed_list_entries);
>>>>> +
>>>>> +             if (instance->my_proc_list_entries == instance->my_failed_list_entries){
>>>>> +                     memb_set_remove (&instance->my_id, 1,
>>>>> +                             instance->my_failed_list, &instance->my_failed_list_entries);
>>>>> +             }
>>>>> +
>>>>>               memb_state_gather_enter (instance, 0);
>>>>>       }
>>>>>  }
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Yunkai Zhang
>>> work at taobao.com
>>>
>>
>>
>>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss
>

-- 
Yunkai Zhang
Work at Taobao
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss