Re: [PATCH] Ignore the commit_token retransmitted by upstream node on the first rotation

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Tue, 13 Dec 2011 15:07:12 +0800

On Tue, Dec 13, 2011 at 1:50 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 12/12/2011 09:37 AM, Yunkai Zhang wrote:
>> On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>> On 12/12/2011 02:13 AM, Yunkai Zhang wrote:
>>>> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote:
>>>>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote:
>>>>>>>> After the representative of the new ring entered into RECOVERY state,
>>>>>>>> it was waiting for the commit_token on the second rotation. But when
>>>>>>>> it got a ncommit_token, it didn't check the status of this token
>>>>>>>> which could be retransmitted by upstream node on the first rotation,
>>>>>>>> and then the representative node would send the initial ORF token
>>>>>>>> prematurely.
>>>>>>>>
>>>>>>>> If other nodes which have not entered into RECOVERY state received this
>>>>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash
>>>>>>>> caused by assert instraction in *orf_token_rtr* function:
>>>>>>>>   ...
>>>>>>>
>>>>>>> Yunkai,
>>>>>>>
>>>>>>> First off thanks for digging into Totem.  I know the protocol is hard to
>>>>>>> understand and its nice to have more people to bounce ideas off of.
>>>>>>>
>>>>>>> Do you have a test case for this?
>>>>>>>
>>>>>>> Patch is a clever solution but the root cause seems incorrect.  I'd like
>>>>>>> to get to the RCA of your bug.  Just thinking about some of the problems
>>>>>>> you have reported in the past (13 seconds to execute a recovery copy per
>>>>>>> node), it could be possible a retransmission of the commit token from 5
>>>>>>> to 1 (kicking off a new orf token) in the below example could occur
>>>>>>> while in recovery.  That should be ok though, since retransmitted orf
>>>>>>> tokens are discarded (totemsrp:3564).  I don't see any immediate way to
>>>>>>> have a premature token transmission.
>>>>>>>
>>>>>>> Lets examine 5 processors:
>>>>>>>
>>>>>>> When in gather, a membership is built.  Once this membership is agreed
>>>>>>> upon, a commit token is originated by the ring rep creating the first
>>>>>>> transmission around the ring.
>>>>>>>
>>>>>>> totemsrp.c:4044
>>>>>>>
>>>>>>> Example with 5 processors:
>>>>>>> 1,2,3,4,5 in gather
>>>>>>> 1 originates commit token
>>>>>>>
>>>>>>>
>>>>>>> Once a commit token is received in the gather state, the processor
>>>>>>> enters commit (first token rotation)
>>>>>>>
>>>>>>> totemsrp.c:4348
>>>>>>>
>>>>>>> Example:
>>>>>>> 1 goes to commit
>>>>>>> 2 goes to commit
>>>>>>> 3 goes to commit
>>>>>>> 4 goes to commit
>>>>>>> 5 goes to commit
>>>>>>>
>>>>>>> Once a commit token is received in the commit state, the processor
>>>>>>> enters the recovery state only if this is a new commit token
>>>>>>>
>>>>>>> totemsrp.c:4360
>>>>>>>
>>>>>>> 1 goes to recovery
>>>>>>> 2 goes to recovery
>>>>>>> 3 goes to recovery
>>>>>>> 4 goes to recovery
>>>>>>> 5 goes to recovery
>>>>>>> 1 originates orf token (this is the start of the third rotation)
>>>>>>>
>>>>>>> totemsrp.c:4374
>>>>>>>
>>>>>>>
>>>>>>> In the above failure free scenario, we see that we will never get a
>>>>>>> premature orf token.
>>>>>>>
>>>>>>> Perhaps the problem your running into is your mixing partition/merging
>>>>>>> while forming a new ring.  This could result in my_id.addr[0] and ring
>>>>>>> rep matching, but *old* commit tokens not being discarded by the ring
>>>>>>> rep in the recovery state.
>>>>>>>
>>>>>>> I believe the code would be better served by something like:
>>>>>>>
>>>>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>>>    memb_commit_token->ring_id.seq == instance->my_ring_id_seq) {
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>> Note it may be safer to compare the entire ring_id rather then only the seq.
>>>>>>>
>>>>>>> This will cause any commit token not from the current agreed processors
>>>>>>> to be rejected.  The ring seq increases by 4 for each new ring and it is
>>>>>>> not possible to have a matching seq + rep in two separate rings (this is
>>>>>>> why we store the ring seq to stable storage).
>>>>>>>
>>>>>>> Could you test that or tell me your test case or both?
>>>>>>>
>>>>>>
>>>>>> Let me discuss my test case based on your example mentioned above:
>>>>>>
>>>>>> 1,2,3,4,5 in gather
>>>>>>
>>>>>> 1 originates commit token
>>>>>>   => enter commit state
>>>>>>     => send commit_token(token_seq:0) to 2
>>>>>>       => receive commit_token(token_seq:4) from 5
>>>>>>         => enter RECOVERY state
>>>>>>           => receive commit_token(token_seq:4) retransmitted by 5
>>>>>>             => enter OPERATIONAL state *prematurely*!!
>>>>>>               => send initial ORF toke to 2 *prematurely*!!
>>>>>>
>>>>>> 2 receive commit_token(token_seq:0) from 1
>>>>>>   =>enter commit state
>>>>>>     => send commit_token(token_seq:1) to 3
>>>>>>       => commit token retransmit timeout occur!!
>>>>>>         => retransmit commit_token(token_seq:1) ...
>>>>>>         => retransmit commit_token(token_seq:1) ...
>>>>>>         => ...
>>>>>>           => receive JoinMSG from 3
>>>>>>             => enter GATHER state, ...
>>>>>>               => receive initial ORF token from 1
>>>>>>                 => crash in *orf_token_rtr* funcation!!
>>>>
>>>> Today, I have checked my logging file carefully again, this
>>>> description did not incorrectly:
>>>>
>>>> 2 receive commit_token(token_seq:0) from 1
>>>>    =>enter commit state
>>>>      => send commit_token(token_seq:1) to 3
>>>>        => receive commit_token(token_seq:6) from 1
>>>>          => enter RECOVERY state, **set orf_token_discard=0**
>>>>            => commit token retransmit timeout occur!!
>>>>              => retransmit commit_token(token_seq:1) ...
>>>>              => retransmit commit_token(token_seq:1) ...
>>>>              => ...
>>>>                => receive JoinMSG from 3 in RECOVERY state
>>>>                  => call memb_recovery_state_token_loss
>>>>                     => old_ring_state_restore, set instance->my_aru=39
>>>>                     => memb_state_gather_enter (instance, 5), **set
>>>> orf_token_discard=1**
>>>>                       => enter GATHER state
>>>>                  => call memb_state_gather_enter (instance, 14), **set
>>>> orf_token_discard=1**
>>>>                     => enter GATHER state
>>>>                   => receive initial ORF token from 1
>>>>                      => crash in *orf_token_rtr* funcation!!
>>>>
>>>> I can confirmed that the value of instance->memb_state was
>>>> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use
>>>> following code for logging in *message_handler_orf_token* function:
>>>>
>>>> =code=
>>>> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d,
>>>> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d",
>>>> nodeid_to_ip(token->header.nodeid, buf),
>>>>                                 token->token_seq,
>>>> instance->my_token_seq, token->aru, instance->my_aru, token->seq,
>>>> instance->memb_state);
>>>> =log output=
>>>> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token
>>>> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff,
>>>> my_aru:39, seq:0, state:2
>>>>
>>>> It was the last log record while corosync exist.
>>>>
>>>> But I can't believed that corosync did not discard the ORF token when
>>>> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync
>>>> enter into GATHER state, orf_token_discard would be set to be 1).
>>>>
>>>> A reasonable guess is that the value of orf_token_discard was cleared
>>>> by unknown reason -- maybe there exist memory corruption.
>>>>
>>>
>>> possible but I have another possibility I'd like you to try if your willing.
>>>
>>> First, can you check the core file with gdb that is created when there
>>> is a crash in orf_token_rtr?  What you will need to do is use up/down to
>>> find the stack frame that has "instance" in it, then print out instance
>>>
>>> print *instance
>>
>> I forgot to print this info, it is a very good tip, thank you very much!
>>
>> I'm at home now, I will try it tomorrow.

The value of intstance->orf_token_discard is 1 and there does not
exist memory corruption. The root cause of this issue result in
corosync crashing is that our code handles join messages in
net_deliver_fn.

>>
>>>
>>> This will print out a bunch of stuff, one of which will be the
>>> orf_token_discard value.  (This will tell us if there is indeed memory
>>> corruption or an error in our understanding of the code.
>>>
>>> Second, run corosync-fplay.  This will play back the flight recorder.
>>> One thing I definitely want to know is when corosync enters gather, who
>>> called it.  Just redirect the flight record into a file and attach to an
>>> email.
>>>
>>
>> I had run corosync-fplay tool when I observed this issue. But our
>> cluster have been restarted again, the old version of fdata file have
>> been destroyed.
>>
>>> Third, I will send a patch which _may_ fix the problem.  Also, which
>>> version of corosync do you have?
>>>
>>
>> Our code is based on corosync-v1.4.1, but I have backported necessary
>> patchs from current master branch.
>>
>
> Do you have the following commit in your repo?  If not all sorts of bad
> things will happen with your test scenarios.
> :
> commit be608c050247e5f9c8266b8a0f9803cc0a3dc881
> Author: Steven Dake <sdake@xxxxxxxxxx>
> Date:   Tue Aug 30 22:25:21 2011 -0700
>
>    Ignore memb_join messages during flush operations
>
>    a memb_join operation that occurs during flushing can result in an
>    entry into the GATHER state from the RECOVERY state.  This results
> in the
>    regular sort queue being used instead of the recovery sort queue,
> resulting
>    in segfault.

I have checked our code, it missed this important patch!
Thanks so much!

>
>    Signed-off-by: Steven Dake <sdake@xxxxxxxxxx>
>    Reviewed-by: Jan Friesse <jfriesse@xxxxxxxxxx>
>    (cherry picked from commit 48ffa8892daac18935d96ae46a72aebe2fb70430)
>
> Regards
> -steve
>
>>> Regards
>>> -steve
>>>
>>>>>>
>>>>>> 3 receive commit_token(token_seq:1) from 2
>>>>>>   => enter commit state
>>>>>>     => send commit_token(token_seq:2) to 4
>>>>>>       => commit token retransmit timeout occur!!
>>>>>>         => retransmit commit_token(token_seq:2) ...
>>>>>>         => retransmit commit_token(token_seq:2) ...
>>>>>>         => ...
>>>>>>           => commit token lost timeout occur!!
>>>>>>             => enter GATHER state, multicast JoinMSG ...
>>>>>>
>>>>>> 4 receive commit_token(token_seq:2) from 3
>>>>>>   => enter commit state
>>>>>>     => send commit_token(token_seq:3) to 5
>>>>>>       => commit token retransmit timeout occur!!
>>>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>>>         => ...
>>>>>>
>>>>>> 5 receive commit_token(token_seq:3) from 4
>>>>>>   => enter commit state
>>>>>>     => send commit_token(token_seq:4) to 1
>>>>>>       => commit token retransmit timeout occur!!
>>>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>>>         => ...
>>>>>>
>>>>>> The key point in my test cast is that one of processors(3 in this
>>>>>> case) runs into commit token timeout, and 1 receive commit token
>>>>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*.
>>>>>> t
>>>>> Yunkai,
>>>>>
>>>>> I took a look at the code with a fresh set of eyes this morning for 3
>>>>> hours and don't see how it is possible for processor 2 to crash in
>>>>> orf_token_rtr with the above scenario.  The orf token will be discarded
>>>>> in all cases in GATHER and COMMIT before executing orf_token_rtr.
>>>>> orf_token_rtr should *never* run in GATHER or COMMIT states of the
>>>>> processor.
>>>>>
>>>>> (flatiron-1.4 line numbers):
>>>>>
>>>>> On startup:
>>>>> totemsrp.c:657  - memb_state = OPERATIONAL
>>>>> totemsrp.c:4467 - when network interface detected call gather_enter
>>>>>        state = gather, orf_token_discard = 1
>>>>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1
>>>>>
>>>>> we can proceed to enter OPERATIONAL here or go directly to COMMIT
>>>>>
>>>>> COMMIT path:
>>>>> totemsrp.c:4331 - COMMIT token received in GATHER state
>>>>> (orf_token_discard still equals 1)
>>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>>>> (orf_token_discard still equals 1)
>>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>>>> equals 1)
>>>>>
>>>>> The only code that resets orf_token_discard to 0 is
>>>>> totemsrp.c:2008 - entered recovery state
>>>>>
>>>>> Analyzing the path where totem transitions from RECOVERY -> GATHER ->
>>>>> COMMIT:
>>>>>
>>>>> totemsrp.c:4276 - memb_state_gather_enter called
>>>>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1
>>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>>>> (orf_token_discard still equals 1)
>>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>>>> equals 1)
>>>>>
>>>>> Given that the orf token would always be discarded by processor 2 in the
>>>>> above test case you gave, I don't see a scenario in the code where
>>>>> orf_token_rtr could crash in that test case.  If you can show me how
>>>>> this happens, we could fix this up.  One possibility is some sort of
>>>>> thread race.  Could you determine that orf_token_discard is equal to 0
>>>>> when your failure occurs?
>>>>>
>>>>> As for entering operational prematurely:
>>>>> If this is a problem, it is a separate problem from the above analysis.
>>>>>  Lets get the above analysis fixed if it is a problem.
>>>>>
>>>>> Really looking forward to hearing back from you so we can fix the
>>>>> problem you see.
>>>>>
>>>>> Regards
>>>>> -steve
>>>>>
>>>>>>> Thanks
>>>>>>> -steve
>>>>>>>
>>>>>>>
>>>>>>>>   range = orf_token->seq - instance->my_aru;
>>>>>>>>   assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>>>>>>   ...
>>>>>>>>
>>>>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>>>>>> ---
>>>>>>>>  exec/totemsrp.c |    3 ++-
>>>>>>>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>>>>>> index 5a78962..22717ac 100644
>>>>>>>> --- a/exec/totemsrp.c
>>>>>>>> +++ b/exec/totemsrp.c
>>>>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token (
>>>>>>>>                       break;
>>>>>>>>
>>>>>>>>               case MEMB_STATE_RECOVERY:
>>>>>>>> -                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) {
>>>>>>>> +                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>>>> +                             memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) {
>>>>>>>>                               log_printf (instance->totemsrp_log_level_debug,
>>>>>>>>                                       "Sending initial ORF token\n");
>>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> discuss mailing list
>>>>>>> discuss@xxxxxxxxxxxx
>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> discuss mailing list
>>>>> discuss@xxxxxxxxxxxx
>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

-- 
Yunkai Zhang
Work at Taobao
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss