Re: [PATCH] Ignore the commit_token retransmitted by upstream node on the first rotation

Steven Dake <sdake@xxxxxxxxxx> · Mon, 12 Dec 2011 10:50:40 -0700

On 12/12/2011 09:37 AM, Yunkai Zhang wrote:
> On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>> On 12/12/2011 02:13 AM, Yunkai Zhang wrote:
>>> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote:
>>>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote:
>>>>>>> After the representative of the new ring entered into RECOVERY state,
>>>>>>> it was waiting for the commit_token on the second rotation. But when
>>>>>>> it got a ncommit_token, it didn't check the status of this token
>>>>>>> which could be retransmitted by upstream node on the first rotation,
>>>>>>> and then the representative node would send the initial ORF token
>>>>>>> prematurely.
>>>>>>>
>>>>>>> If other nodes which have not entered into RECOVERY state received this
>>>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash
>>>>>>> caused by assert instraction in *orf_token_rtr* function:
>>>>>>>   ...
>>>>>>
>>>>>> Yunkai,
>>>>>>
>>>>>> First off thanks for digging into Totem.  I know the protocol is hard to
>>>>>> understand and its nice to have more people to bounce ideas off of.
>>>>>>
>>>>>> Do you have a test case for this?
>>>>>>
>>>>>> Patch is a clever solution but the root cause seems incorrect.  I'd like
>>>>>> to get to the RCA of your bug.  Just thinking about some of the problems
>>>>>> you have reported in the past (13 seconds to execute a recovery copy per
>>>>>> node), it could be possible a retransmission of the commit token from 5
>>>>>> to 1 (kicking off a new orf token) in the below example could occur
>>>>>> while in recovery.  That should be ok though, since retransmitted orf
>>>>>> tokens are discarded (totemsrp:3564).  I don't see any immediate way to
>>>>>> have a premature token transmission.
>>>>>>
>>>>>> Lets examine 5 processors:
>>>>>>
>>>>>> When in gather, a membership is built.  Once this membership is agreed
>>>>>> upon, a commit token is originated by the ring rep creating the first
>>>>>> transmission around the ring.
>>>>>>
>>>>>> totemsrp.c:4044
>>>>>>
>>>>>> Example with 5 processors:
>>>>>> 1,2,3,4,5 in gather
>>>>>> 1 originates commit token
>>>>>>
>>>>>>
>>>>>> Once a commit token is received in the gather state, the processor
>>>>>> enters commit (first token rotation)
>>>>>>
>>>>>> totemsrp.c:4348
>>>>>>
>>>>>> Example:
>>>>>> 1 goes to commit
>>>>>> 2 goes to commit
>>>>>> 3 goes to commit
>>>>>> 4 goes to commit
>>>>>> 5 goes to commit
>>>>>>
>>>>>> Once a commit token is received in the commit state, the processor
>>>>>> enters the recovery state only if this is a new commit token
>>>>>>
>>>>>> totemsrp.c:4360
>>>>>>
>>>>>> 1 goes to recovery
>>>>>> 2 goes to recovery
>>>>>> 3 goes to recovery
>>>>>> 4 goes to recovery
>>>>>> 5 goes to recovery
>>>>>> 1 originates orf token (this is the start of the third rotation)
>>>>>>
>>>>>> totemsrp.c:4374
>>>>>>
>>>>>>
>>>>>> In the above failure free scenario, we see that we will never get a
>>>>>> premature orf token.
>>>>>>
>>>>>> Perhaps the problem your running into is your mixing partition/merging
>>>>>> while forming a new ring.  This could result in my_id.addr[0] and ring
>>>>>> rep matching, but *old* commit tokens not being discarded by the ring
>>>>>> rep in the recovery state.
>>>>>>
>>>>>> I believe the code would be better served by something like:
>>>>>>
>>>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>>    memb_commit_token->ring_id.seq == instance->my_ring_id_seq) {
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>> Note it may be safer to compare the entire ring_id rather then only the seq.
>>>>>>
>>>>>> This will cause any commit token not from the current agreed processors
>>>>>> to be rejected.  The ring seq increases by 4 for each new ring and it is
>>>>>> not possible to have a matching seq + rep in two separate rings (this is
>>>>>> why we store the ring seq to stable storage).
>>>>>>
>>>>>> Could you test that or tell me your test case or both?
>>>>>>
>>>>>
>>>>> Let me discuss my test case based on your example mentioned above:
>>>>>
>>>>> 1,2,3,4,5 in gather
>>>>>
>>>>> 1 originates commit token
>>>>>   => enter commit state
>>>>>     => send commit_token(token_seq:0) to 2
>>>>>       => receive commit_token(token_seq:4) from 5
>>>>>         => enter RECOVERY state
>>>>>           => receive commit_token(token_seq:4) retransmitted by 5
>>>>>             => enter OPERATIONAL state *prematurely*!!
>>>>>               => send initial ORF toke to 2 *prematurely*!!
>>>>>
>>>>> 2 receive commit_token(token_seq:0) from 1
>>>>>   =>enter commit state
>>>>>     => send commit_token(token_seq:1) to 3
>>>>>       => commit token retransmit timeout occur!!
>>>>>         => retransmit commit_token(token_seq:1) ...
>>>>>         => retransmit commit_token(token_seq:1) ...
>>>>>         => ...
>>>>>           => receive JoinMSG from 3
>>>>>             => enter GATHER state, ...
>>>>>               => receive initial ORF token from 1
>>>>>                 => crash in *orf_token_rtr* funcation!!
>>>
>>> Today, I have checked my logging file carefully again, this
>>> description did not incorrectly:
>>>
>>> 2 receive commit_token(token_seq:0) from 1
>>>    =>enter commit state
>>>      => send commit_token(token_seq:1) to 3
>>>        => receive commit_token(token_seq:6) from 1
>>>          => enter RECOVERY state, **set orf_token_discard=0**
>>>            => commit token retransmit timeout occur!!
>>>              => retransmit commit_token(token_seq:1) ...
>>>              => retransmit commit_token(token_seq:1) ...
>>>              => ...
>>>                => receive JoinMSG from 3 in RECOVERY state
>>>                  => call memb_recovery_state_token_loss
>>>                     => old_ring_state_restore, set instance->my_aru=39
>>>                     => memb_state_gather_enter (instance, 5), **set
>>> orf_token_discard=1**
>>>                       => enter GATHER state
>>>                  => call memb_state_gather_enter (instance, 14), **set
>>> orf_token_discard=1**
>>>                     => enter GATHER state
>>>                   => receive initial ORF token from 1
>>>                      => crash in *orf_token_rtr* funcation!!
>>>
>>> I can confirmed that the value of instance->memb_state was
>>> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use
>>> following code for logging in *message_handler_orf_token* function:
>>>
>>> =code=
>>> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d,
>>> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d",
>>> nodeid_to_ip(token->header.nodeid, buf),
>>>                                 token->token_seq,
>>> instance->my_token_seq, token->aru, instance->my_aru, token->seq,
>>> instance->memb_state);
>>> =log output=
>>> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token
>>> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff,
>>> my_aru:39, seq:0, state:2
>>>
>>> It was the last log record while corosync exist.
>>>
>>> But I can't believed that corosync did not discard the ORF token when
>>> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync
>>> enter into GATHER state, orf_token_discard would be set to be 1).
>>>
>>> A reasonable guess is that the value of orf_token_discard was cleared
>>> by unknown reason -- maybe there exist memory corruption.
>>>
>>
>> possible but I have another possibility I'd like you to try if your willing.
>>
>> First, can you check the core file with gdb that is created when there
>> is a crash in orf_token_rtr?  What you will need to do is use up/down to
>> find the stack frame that has "instance" in it, then print out instance
>>
>> print *instance
> 
> I forgot to print this info, it is a very good tip, thank you very much!
> 
> I'm at home now, I will try it tomorrow.
> 
>>
>> This will print out a bunch of stuff, one of which will be the
>> orf_token_discard value.  (This will tell us if there is indeed memory
>> corruption or an error in our understanding of the code.
>>
>> Second, run corosync-fplay.  This will play back the flight recorder.
>> One thing I definitely want to know is when corosync enters gather, who
>> called it.  Just redirect the flight record into a file and attach to an
>> email.
>>
> 
> I had run corosync-fplay tool when I observed this issue. But our
> cluster have been restarted again, the old version of fdata file have
> been destroyed.
> 
>> Third, I will send a patch which _may_ fix the problem.  Also, which
>> version of corosync do you have?
>>
> 
> Our code is based on corosync-v1.4.1, but I have backported necessary
> patchs from current master branch.
> 

Do you have the following commit in your repo?  If not all sorts of bad
things will happen with your test scenarios.
:
commit be608c050247e5f9c8266b8a0f9803cc0a3dc881
Author: Steven Dake <sdake@xxxxxxxxxx>
Date:   Tue Aug 30 22:25:21 2011 -0700

    Ignore memb_join messages during flush operations

    a memb_join operation that occurs during flushing can result in an
    entry into the GATHER state from the RECOVERY state.  This results
in the
    regular sort queue being used instead of the recovery sort queue,
resulting
    in segfault.

    Signed-off-by: Steven Dake <sdake@xxxxxxxxxx>
    Reviewed-by: Jan Friesse <jfriesse@xxxxxxxxxx>
    (cherry picked from commit 48ffa8892daac18935d96ae46a72aebe2fb70430)

Regards
-steve

>> Regards
>> -steve
>>
>>>>>
>>>>> 3 receive commit_token(token_seq:1) from 2
>>>>>   => enter commit state
>>>>>     => send commit_token(token_seq:2) to 4
>>>>>       => commit token retransmit timeout occur!!
>>>>>         => retransmit commit_token(token_seq:2) ...
>>>>>         => retransmit commit_token(token_seq:2) ...
>>>>>         => ...
>>>>>           => commit token lost timeout occur!!
>>>>>             => enter GATHER state, multicast JoinMSG ...
>>>>>
>>>>> 4 receive commit_token(token_seq:2) from 3
>>>>>   => enter commit state
>>>>>     => send commit_token(token_seq:3) to 5
>>>>>       => commit token retransmit timeout occur!!
>>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>>         => ...
>>>>>
>>>>> 5 receive commit_token(token_seq:3) from 4
>>>>>   => enter commit state
>>>>>     => send commit_token(token_seq:4) to 1
>>>>>       => commit token retransmit timeout occur!!
>>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>>         => ...
>>>>>
>>>>> The key point in my test cast is that one of processors(3 in this
>>>>> case) runs into commit token timeout, and 1 receive commit token
>>>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*.
>>>>> t
>>>> Yunkai,
>>>>
>>>> I took a look at the code with a fresh set of eyes this morning for 3
>>>> hours and don't see how it is possible for processor 2 to crash in
>>>> orf_token_rtr with the above scenario.  The orf token will be discarded
>>>> in all cases in GATHER and COMMIT before executing orf_token_rtr.
>>>> orf_token_rtr should *never* run in GATHER or COMMIT states of the
>>>> processor.
>>>>
>>>> (flatiron-1.4 line numbers):
>>>>
>>>> On startup:
>>>> totemsrp.c:657  - memb_state = OPERATIONAL
>>>> totemsrp.c:4467 - when network interface detected call gather_enter
>>>>        state = gather, orf_token_discard = 1
>>>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1
>>>>
>>>> we can proceed to enter OPERATIONAL here or go directly to COMMIT
>>>>
>>>> COMMIT path:
>>>> totemsrp.c:4331 - COMMIT token received in GATHER state
>>>> (orf_token_discard still equals 1)
>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>>> (orf_token_discard still equals 1)
>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>>> equals 1)
>>>>
>>>> The only code that resets orf_token_discard to 0 is
>>>> totemsrp.c:2008 - entered recovery state
>>>>
>>>> Analyzing the path where totem transitions from RECOVERY -> GATHER ->
>>>> COMMIT:
>>>>
>>>> totemsrp.c:4276 - memb_state_gather_enter called
>>>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1
>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>>> (orf_token_discard still equals 1)
>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>>> equals 1)
>>>>
>>>> Given that the orf token would always be discarded by processor 2 in the
>>>> above test case you gave, I don't see a scenario in the code where
>>>> orf_token_rtr could crash in that test case.  If you can show me how
>>>> this happens, we could fix this up.  One possibility is some sort of
>>>> thread race.  Could you determine that orf_token_discard is equal to 0
>>>> when your failure occurs?
>>>>
>>>> As for entering operational prematurely:
>>>> If this is a problem, it is a separate problem from the above analysis.
>>>>  Lets get the above analysis fixed if it is a problem.
>>>>
>>>> Really looking forward to hearing back from you so we can fix the
>>>> problem you see.
>>>>
>>>> Regards
>>>> -steve
>>>>
>>>>>> Thanks
>>>>>> -steve
>>>>>>
>>>>>>
>>>>>>>   range = orf_token->seq - instance->my_aru;
>>>>>>>   assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>>>>>   ...
>>>>>>>
>>>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>>>>> ---
>>>>>>>  exec/totemsrp.c |    3 ++-
>>>>>>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>>>>>>
>>>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>>>>> index 5a78962..22717ac 100644
>>>>>>> --- a/exec/totemsrp.c
>>>>>>> +++ b/exec/totemsrp.c
>>>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token (
>>>>>>>                       break;
>>>>>>>
>>>>>>>               case MEMB_STATE_RECOVERY:
>>>>>>> -                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) {
>>>>>>> +                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>>> +                             memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) {
>>>>>>>                               log_printf (instance->totemsrp_log_level_debug,
>>>>>>>                                       "Sending initial ORF token\n");
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> discuss@xxxxxxxxxxxx
>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
> 
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss