Re: [PATCH] Ignore the commit_token retransmitted by upstream node on the first rotation

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Tue, 13 Dec 2011 00:37:18 +0800

On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 12/12/2011 02:13 AM, Yunkai Zhang wrote:
>> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote:
>>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote:
>>>>>> After the representative of the new ring entered into RECOVERY state,
>>>>>> it was waiting for the commit_token on the second rotation. But when
>>>>>> it got a ncommit_token, it didn't check the status of this token
>>>>>> which could be retransmitted by upstream node on the first rotation,
>>>>>> and then the representative node would send the initial ORF token
>>>>>> prematurely.
>>>>>>
>>>>>> If other nodes which have not entered into RECOVERY state received this
>>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash
>>>>>> caused by assert instraction in *orf_token_rtr* function:
>>>>>>   ...
>>>>>
>>>>> Yunkai,
>>>>>
>>>>> First off thanks for digging into Totem.  I know the protocol is hard to
>>>>> understand and its nice to have more people to bounce ideas off of.
>>>>>
>>>>> Do you have a test case for this?
>>>>>
>>>>> Patch is a clever solution but the root cause seems incorrect.  I'd like
>>>>> to get to the RCA of your bug.  Just thinking about some of the problems
>>>>> you have reported in the past (13 seconds to execute a recovery copy per
>>>>> node), it could be possible a retransmission of the commit token from 5
>>>>> to 1 (kicking off a new orf token) in the below example could occur
>>>>> while in recovery.  That should be ok though, since retransmitted orf
>>>>> tokens are discarded (totemsrp:3564).  I don't see any immediate way to
>>>>> have a premature token transmission.
>>>>>
>>>>> Lets examine 5 processors:
>>>>>
>>>>> When in gather, a membership is built.  Once this membership is agreed
>>>>> upon, a commit token is originated by the ring rep creating the first
>>>>> transmission around the ring.
>>>>>
>>>>> totemsrp.c:4044
>>>>>
>>>>> Example with 5 processors:
>>>>> 1,2,3,4,5 in gather
>>>>> 1 originates commit token
>>>>>
>>>>>
>>>>> Once a commit token is received in the gather state, the processor
>>>>> enters commit (first token rotation)
>>>>>
>>>>> totemsrp.c:4348
>>>>>
>>>>> Example:
>>>>> 1 goes to commit
>>>>> 2 goes to commit
>>>>> 3 goes to commit
>>>>> 4 goes to commit
>>>>> 5 goes to commit
>>>>>
>>>>> Once a commit token is received in the commit state, the processor
>>>>> enters the recovery state only if this is a new commit token
>>>>>
>>>>> totemsrp.c:4360
>>>>>
>>>>> 1 goes to recovery
>>>>> 2 goes to recovery
>>>>> 3 goes to recovery
>>>>> 4 goes to recovery
>>>>> 5 goes to recovery
>>>>> 1 originates orf token (this is the start of the third rotation)
>>>>>
>>>>> totemsrp.c:4374
>>>>>
>>>>>
>>>>> In the above failure free scenario, we see that we will never get a
>>>>> premature orf token.
>>>>>
>>>>> Perhaps the problem your running into is your mixing partition/merging
>>>>> while forming a new ring.  This could result in my_id.addr[0] and ring
>>>>> rep matching, but *old* commit tokens not being discarded by the ring
>>>>> rep in the recovery state.
>>>>>
>>>>> I believe the code would be better served by something like:
>>>>>
>>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>    memb_commit_token->ring_id.seq == instance->my_ring_id_seq) {
>>>>> ...
>>>>> }
>>>>>
>>>>> Note it may be safer to compare the entire ring_id rather then only the seq.
>>>>>
>>>>> This will cause any commit token not from the current agreed processors
>>>>> to be rejected.  The ring seq increases by 4 for each new ring and it is
>>>>> not possible to have a matching seq + rep in two separate rings (this is
>>>>> why we store the ring seq to stable storage).
>>>>>
>>>>> Could you test that or tell me your test case or both?
>>>>>
>>>>
>>>> Let me discuss my test case based on your example mentioned above:
>>>>
>>>> 1,2,3,4,5 in gather
>>>>
>>>> 1 originates commit token
>>>>   => enter commit state
>>>>     => send commit_token(token_seq:0) to 2
>>>>       => receive commit_token(token_seq:4) from 5
>>>>         => enter RECOVERY state
>>>>           => receive commit_token(token_seq:4) retransmitted by 5
>>>>             => enter OPERATIONAL state *prematurely*!!
>>>>               => send initial ORF toke to 2 *prematurely*!!
>>>>
>>>> 2 receive commit_token(token_seq:0) from 1
>>>>   =>enter commit state
>>>>     => send commit_token(token_seq:1) to 3
>>>>       => commit token retransmit timeout occur!!
>>>>         => retransmit commit_token(token_seq:1) ...
>>>>         => retransmit commit_token(token_seq:1) ...
>>>>         => ...
>>>>           => receive JoinMSG from 3
>>>>             => enter GATHER state, ...
>>>>               => receive initial ORF token from 1
>>>>                 => crash in *orf_token_rtr* funcation!!
>>
>> Today, I have checked my logging file carefully again, this
>> description did not incorrectly:
>>
>> 2 receive commit_token(token_seq:0) from 1
>>    =>enter commit state
>>      => send commit_token(token_seq:1) to 3
>>        => receive commit_token(token_seq:6) from 1
>>          => enter RECOVERY state, **set orf_token_discard=0**
>>            => commit token retransmit timeout occur!!
>>              => retransmit commit_token(token_seq:1) ...
>>              => retransmit commit_token(token_seq:1) ...
>>              => ...
>>                => receive JoinMSG from 3 in RECOVERY state
>>                  => call memb_recovery_state_token_loss
>>                     => old_ring_state_restore, set instance->my_aru=39
>>                     => memb_state_gather_enter (instance, 5), **set
>> orf_token_discard=1**
>>                       => enter GATHER state
>>                  => call memb_state_gather_enter (instance, 14), **set
>> orf_token_discard=1**
>>                     => enter GATHER state
>>                   => receive initial ORF token from 1
>>                      => crash in *orf_token_rtr* funcation!!
>>
>> I can confirmed that the value of instance->memb_state was
>> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use
>> following code for logging in *message_handler_orf_token* function:
>>
>> =code=
>> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d,
>> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d",
>> nodeid_to_ip(token->header.nodeid, buf),
>>                                 token->token_seq,
>> instance->my_token_seq, token->aru, instance->my_aru, token->seq,
>> instance->memb_state);
>> =log output=
>> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token
>> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff,
>> my_aru:39, seq:0, state:2
>>
>> It was the last log record while corosync exist.
>>
>> But I can't believed that corosync did not discard the ORF token when
>> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync
>> enter into GATHER state, orf_token_discard would be set to be 1).
>>
>> A reasonable guess is that the value of orf_token_discard was cleared
>> by unknown reason -- maybe there exist memory corruption.
>>
>
> possible but I have another possibility I'd like you to try if your willing.
>
> First, can you check the core file with gdb that is created when there
> is a crash in orf_token_rtr?  What you will need to do is use up/down to
> find the stack frame that has "instance" in it, then print out instance
>
> print *instance

I forgot to print this info, it is a very good tip, thank you very much!

I'm at home now, I will try it tomorrow.

>
> This will print out a bunch of stuff, one of which will be the
> orf_token_discard value.  (This will tell us if there is indeed memory
> corruption or an error in our understanding of the code.
>
> Second, run corosync-fplay.  This will play back the flight recorder.
> One thing I definitely want to know is when corosync enters gather, who
> called it.  Just redirect the flight record into a file and attach to an
> email.
>

I had run corosync-fplay tool when I observed this issue. But our
cluster have been restarted again, the old version of fdata file have
been destroyed.

> Third, I will send a patch which _may_ fix the problem.  Also, which
> version of corosync do you have?
>

Our code is based on corosync-v1.4.1, but I have backported necessary
patchs from current master branch.

> Regards
> -steve
>
>>>>
>>>> 3 receive commit_token(token_seq:1) from 2
>>>>   => enter commit state
>>>>     => send commit_token(token_seq:2) to 4
>>>>       => commit token retransmit timeout occur!!
>>>>         => retransmit commit_token(token_seq:2) ...
>>>>         => retransmit commit_token(token_seq:2) ...
>>>>         => ...
>>>>           => commit token lost timeout occur!!
>>>>             => enter GATHER state, multicast JoinMSG ...
>>>>
>>>> 4 receive commit_token(token_seq:2) from 3
>>>>   => enter commit state
>>>>     => send commit_token(token_seq:3) to 5
>>>>       => commit token retransmit timeout occur!!
>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>         => ...
>>>>
>>>> 5 receive commit_token(token_seq:3) from 4
>>>>   => enter commit state
>>>>     => send commit_token(token_seq:4) to 1
>>>>       => commit token retransmit timeout occur!!
>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>         => ...
>>>>
>>>> The key point in my test cast is that one of processors(3 in this
>>>> case) runs into commit token timeout, and 1 receive commit token
>>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*.
>>>> t
>>> Yunkai,
>>>
>>> I took a look at the code with a fresh set of eyes this morning for 3
>>> hours and don't see how it is possible for processor 2 to crash in
>>> orf_token_rtr with the above scenario.  The orf token will be discarded
>>> in all cases in GATHER and COMMIT before executing orf_token_rtr.
>>> orf_token_rtr should *never* run in GATHER or COMMIT states of the
>>> processor.
>>>
>>> (flatiron-1.4 line numbers):
>>>
>>> On startup:
>>> totemsrp.c:657  - memb_state = OPERATIONAL
>>> totemsrp.c:4467 - when network interface detected call gather_enter
>>>        state = gather, orf_token_discard = 1
>>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1
>>>
>>> we can proceed to enter OPERATIONAL here or go directly to COMMIT
>>>
>>> COMMIT path:
>>> totemsrp.c:4331 - COMMIT token received in GATHER state
>>> (orf_token_discard still equals 1)
>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>> (orf_token_discard still equals 1)
>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>> equals 1)
>>>
>>> The only code that resets orf_token_discard to 0 is
>>> totemsrp.c:2008 - entered recovery state
>>>
>>> Analyzing the path where totem transitions from RECOVERY -> GATHER ->
>>> COMMIT:
>>>
>>> totemsrp.c:4276 - memb_state_gather_enter called
>>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1
>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>> (orf_token_discard still equals 1)
>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>> equals 1)
>>>
>>> Given that the orf token would always be discarded by processor 2 in the
>>> above test case you gave, I don't see a scenario in the code where
>>> orf_token_rtr could crash in that test case.  If you can show me how
>>> this happens, we could fix this up.  One possibility is some sort of
>>> thread race.  Could you determine that orf_token_discard is equal to 0
>>> when your failure occurs?
>>>
>>> As for entering operational prematurely:
>>> If this is a problem, it is a separate problem from the above analysis.
>>>  Lets get the above analysis fixed if it is a problem.
>>>
>>> Really looking forward to hearing back from you so we can fix the
>>> problem you see.
>>>
>>> Regards
>>> -steve
>>>
>>>>> Thanks
>>>>> -steve
>>>>>
>>>>>
>>>>>>   range = orf_token->seq - instance->my_aru;
>>>>>>   assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>>>>   ...
>>>>>>
>>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>>>> ---
>>>>>>  exec/totemsrp.c |    3 ++-
>>>>>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>>>>>
>>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>>>> index 5a78962..22717ac 100644
>>>>>> --- a/exec/totemsrp.c
>>>>>> +++ b/exec/totemsrp.c
>>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token (
>>>>>>                       break;
>>>>>>
>>>>>>               case MEMB_STATE_RECOVERY:
>>>>>> -                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) {
>>>>>> +                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>> +                             memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) {
>>>>>>                               log_printf (instance->totemsrp_log_level_debug,
>>>>>>                                       "Sending initial ORF token\n");
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> discuss@xxxxxxxxxxxx
>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

-- 
Yunkai Zhang
Work at Taobao
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss