Re: [RFC PATCH] Resolve an abnormal exit when consensus timeout expired.

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Wed, 12 Oct 2011 20:24:07 +0800

Hi Steven Dake,

Thank you for your reply.
I would change this discussion CC to discuss@xxxxxxxxxxxx instead of
openais@xxxxxxxxxxxxxx
as we known openais@xxxxxxxxxxxxxx could not work.

On Tue, Oct 11, 2011 at 10:54 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 10/10/2011 08:58 PM, qiushu.zyk@xxxxxxxxxx wrote:
>> From: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>
>> In our 20 nodes cluster testing(corosync v1.4.1 vs sheepdog), an abormal exit
>> would occur when consensus timeout expired and if there was no other processors
>> in consensus_list.
>>
>> = analysis =
>> 1. when consenus timeout, corosync would enter memb_state_consensus_timeout_expired
>>    function.
>>
>> 2. if its consensus_list only contain my_id, the code would
>>    execute memb_set_merge which make my_failed_list equal to
>>    my_proc_list.
>>
>> 3. call memb_state_gather_enter function and mcast a join message which
>>    contain proc_list and failed_list with the same processor IDs.
>>
>> 4. the join message would be received by itself and
>>    memb_join_process/memb_consensus_agreed would by called.
>>
>> 5. because the proc_list equal to failed_list in the join message,
>>    the assert instruction will be reached, then an abnormal exit occur.
>>
>> = solution =
>> This patch try to resolve this issue by remove my_id from
>> my_failed_list before corosync call memb_state_gather_enter from
>> memb_state_consensus_timeout_expired.
>>
>> when network partition occur and the processor can't communicate with
>> all other processors, it will form an single ring with only itself other
>> than trigger abnormal exit.
>>
>
> Yunkai,
>
> Thank you for the patch.  I am really hesitant to make any changes to
> totemsrp that I haven't thought long and hard about.  Your solution is
> clever and well thought out, but totem has thousands of details - to the
> point that I always want to get to the root cause of the issue when
> fixing problems.
>
> I think your running into a "FAILED TO RECV" state.  There is a patch
> outstanding for this issue but we have been unable to find anyone to
> test it.  Our environments don't demonstrate a failed to receive scenario.
>
> Can you verify if you have FAILED TO RECV (should be in the fplay data).
>

Yes, corosync will run into a "FAILED TO RECV" state in
message_handler_orf_token function.
We can see the last three lines of logging messages from
/var/log/cluster/corosync.log as following:

   Aug 25 13:42:25 corosync [TOTEM ] FAILED TO RECEIVE
   Aug 25 13:42:25 corosync [TOTEM ] entering GATHER state from 6.
   Aug 25 13:42:27 corosync [TOTEM ] entering GATHER state from 0.

According our testing, we can duplicate this issue so easily with
two conditions:
1). more than 20 nodes (not exactly).
2). increasing the network load  which will cause broadcasting
message(regular msg/join msg) failed  frequently.

> If so, can you run with the patch:
> https://bugzilla.redhat.com/show_bug.cgi?id=636583 comment #8

Thank for your patch,
I plan to test it next week as I have no enough nodes to test now.

>
> Thanks!
> -steve
>
>
>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>> ---
>>  exec/totemsrp.c |   33 +++++++++++++++++++++++++++++++++
>>  1 files changed, 33 insertions(+), 0 deletions(-)
>>
>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>> index 0778d55..653f801 100644
>> --- a/exec/totemsrp.c
>> +++ b/exec/totemsrp.c
>> @@ -1324,6 +1324,33 @@ static void memb_set_merge (
>>       return;
>>  }
>>
>> +/*
>> + * remove subset from fullset
>> + */
>> +static void memb_set_remove(
>> +     const struct srp_addr *subset, int subset_entries,
>> +     struct srp_addr *fullset, int *fullset_entries)
>> +{
>> +     int found = 0;
>> +     int i;
>> +     int j;
>> +
>> +     for (i = 0; i < subset_entries; i++) {
>> +             for (j = 0; j < *fullset_entries; j++) {
>> +                     if (srp_addr_equal (&fullset[j], &subset[i])) {
>> +                             found = 1;
>> +                             break;
>> +                     }
>> +             }
>> +             if (found == 1) {
>> +                     for (; j < (*fullset_entries-1); j++) {
>> +                             srp_addr_copy (&fullset[j], &fullset[j+1]);
>> +                     }
>> +                     *fullset_entries = *fullset_entries - 1;
>> +             }
>> +     }
>> +}
>> +
>>  static void memb_set_and_with_ring_id (
>>       struct srp_addr *set1,
>>       struct memb_ring_id *set1_ring_ids,
>> @@ -1541,6 +1568,12 @@ static void memb_state_consensus_timeout_expired (
>>
>>               memb_set_merge (no_consensus_list, no_consensus_list_entries,
>>                       instance->my_failed_list, &instance->my_failed_list_entries);
>> +
>> +             if (instance->my_proc_list_entries == instance->my_failed_list_entries){
>> +                     memb_set_remove (&instance->my_id, 1,
>> +                             instance->my_failed_list, &instance->my_failed_list_entries);
>> +             }
>> +
>>               memb_state_gather_enter (instance, 0);
>>       }
>>  }
>
>

-- 
Yunkai Zhang
work at taobao.com
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss