Re: [PATCH] Ignore the commit_token retransmitted by upstream node on the first rotation

Steven Dake <sdake@xxxxxxxxxx> · Tue, 13 Dec 2011 08:42:46 -0700

I would recommend checking your tree against the latest 1.4.z stream.
We _only_ fix bugs in z streams.  Generally 1.4.z has fixes for all
known problems we are aware of.

Regards
-steve

On 12/13/2011 12:07 AM, Yunkai Zhang wrote:
> On Tue, Dec 13, 2011 at 1:50 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>> On 12/12/2011 09:37 AM, Yunkai Zhang wrote:
>>> On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>> On 12/12/2011 02:13 AM, Yunkai Zhang wrote:
>>>>> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>>>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote:
>>>>>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>>>>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote:
>>>>>>>>> After the representative of the new ring entered into RECOVERY state,
>>>>>>>>> it was waiting for the commit_token on the second rotation. But when
>>>>>>>>> it got a ncommit_token, it didn't check the status of this token
>>>>>>>>> which could be retransmitted by upstream node on the first rotation,
>>>>>>>>> and then the representative node would send the initial ORF token
>>>>>>>>> prematurely.
>>>>>>>>>
>>>>>>>>> If other nodes which have not entered into RECOVERY state received this
>>>>>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash
>>>>>>>>> caused by assert instraction in *orf_token_rtr* function:
>>>>>>>>>   ...
>>>>>>>>
>>>>>>>> Yunkai,
>>>>>>>>
>>>>>>>> First off thanks for digging into Totem.  I know the protocol is hard to
>>>>>>>> understand and its nice to have more people to bounce ideas off of.
>>>>>>>>
>>>>>>>> Do you have a test case for this?
>>>>>>>>
>>>>>>>> Patch is a clever solution but the root cause seems incorrect.  I'd like
>>>>>>>> to get to the RCA of your bug.  Just thinking about some of the problems
>>>>>>>> you have reported in the past (13 seconds to execute a recovery copy per
>>>>>>>> node), it could be possible a retransmission of the commit token from 5
>>>>>>>> to 1 (kicking off a new orf token) in the below example could occur
>>>>>>>> while in recovery.  That should be ok though, since retransmitted orf
>>>>>>>> tokens are discarded (totemsrp:3564).  I don't see any immediate way to
>>>>>>>> have a premature token transmission.
>>>>>>>>
>>>>>>>> Lets examine 5 processors:
>>>>>>>>
>>>>>>>> When in gather, a membership is built.  Once this membership is agreed
>>>>>>>> upon, a commit token is originated by the ring rep creating the first
>>>>>>>> transmission around the ring.
>>>>>>>>
>>>>>>>> totemsrp.c:4044
>>>>>>>>
>>>>>>>> Example with 5 processors:
>>>>>>>> 1,2,3,4,5 in gather
>>>>>>>> 1 originates commit token
>>>>>>>>
>>>>>>>>
>>>>>>>> Once a commit token is received in the gather state, the processor
>>>>>>>> enters commit (first token rotation)
>>>>>>>>
>>>>>>>> totemsrp.c:4348
>>>>>>>>
>>>>>>>> Example:
>>>>>>>> 1 goes to commit
>>>>>>>> 2 goes to commit
>>>>>>>> 3 goes to commit
>>>>>>>> 4 goes to commit
>>>>>>>> 5 goes to commit
>>>>>>>>
>>>>>>>> Once a commit token is received in the commit state, the processor
>>>>>>>> enters the recovery state only if this is a new commit token
>>>>>>>>
>>>>>>>> totemsrp.c:4360
>>>>>>>>
>>>>>>>> 1 goes to recovery
>>>>>>>> 2 goes to recovery
>>>>>>>> 3 goes to recovery
>>>>>>>> 4 goes to recovery
>>>>>>>> 5 goes to recovery
>>>>>>>> 1 originates orf token (this is the start of the third rotation)
>>>>>>>>
>>>>>>>> totemsrp.c:4374
>>>>>>>>
>>>>>>>>
>>>>>>>> In the above failure free scenario, we see that we will never get a
>>>>>>>> premature orf token.
>>>>>>>>
>>>>>>>> Perhaps the problem your running into is your mixing partition/merging
>>>>>>>> while forming a new ring.  This could result in my_id.addr[0] and ring
>>>>>>>> rep matching, but *old* commit tokens not being discarded by the ring
>>>>>>>> rep in the recovery state.
>>>>>>>>
>>>>>>>> I believe the code would be better served by something like:
>>>>>>>>
>>>>>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>>>>    memb_commit_token->ring_id.seq == instance->my_ring_id_seq) {
>>>>>>>> ...
>>>>>>>> }
>>>>>>>>
>>>>>>>> Note it may be safer to compare the entire ring_id rather then only the seq.
>>>>>>>>
>>>>>>>> This will cause any commit token not from the current agreed processors
>>>>>>>> to be rejected.  The ring seq increases by 4 for each new ring and it is
>>>>>>>> not possible to have a matching seq + rep in two separate rings (this is
>>>>>>>> why we store the ring seq to stable storage).
>>>>>>>>
>>>>>>>> Could you test that or tell me your test case or both?
>>>>>>>>
>>>>>>>
>>>>>>> Let me discuss my test case based on your example mentioned above:
>>>>>>>
>>>>>>> 1,2,3,4,5 in gather
>>>>>>>
>>>>>>> 1 originates commit token
>>>>>>>   => enter commit state
>>>>>>>     => send commit_token(token_seq:0) to 2
>>>>>>>       => receive commit_token(token_seq:4) from 5
>>>>>>>         => enter RECOVERY state
>>>>>>>           => receive commit_token(token_seq:4) retransmitted by 5
>>>>>>>             => enter OPERATIONAL state *prematurely*!!
>>>>>>>               => send initial ORF toke to 2 *prematurely*!!
>>>>>>>
>>>>>>> 2 receive commit_token(token_seq:0) from 1
>>>>>>>   =>enter commit state
>>>>>>>     => send commit_token(token_seq:1) to 3
>>>>>>>       => commit token retransmit timeout occur!!
>>>>>>>         => retransmit commit_token(token_seq:1) ...
>>>>>>>         => retransmit commit_token(token_seq:1) ...
>>>>>>>         => ...
>>>>>>>           => receive JoinMSG from 3
>>>>>>>             => enter GATHER state, ...
>>>>>>>               => receive initial ORF token from 1
>>>>>>>                 => crash in *orf_token_rtr* funcation!!
>>>>>
>>>>> Today, I have checked my logging file carefully again, this
>>>>> description did not incorrectly:
>>>>>
>>>>> 2 receive commit_token(token_seq:0) from 1
>>>>>    =>enter commit state
>>>>>      => send commit_token(token_seq:1) to 3
>>>>>        => receive commit_token(token_seq:6) from 1
>>>>>          => enter RECOVERY state, **set orf_token_discard=0**
>>>>>            => commit token retransmit timeout occur!!
>>>>>              => retransmit commit_token(token_seq:1) ...
>>>>>              => retransmit commit_token(token_seq:1) ...
>>>>>              => ...
>>>>>                => receive JoinMSG from 3 in RECOVERY state
>>>>>                  => call memb_recovery_state_token_loss
>>>>>                     => old_ring_state_restore, set instance->my_aru=39
>>>>>                     => memb_state_gather_enter (instance, 5), **set
>>>>> orf_token_discard=1**
>>>>>                       => enter GATHER state
>>>>>                  => call memb_state_gather_enter (instance, 14), **set
>>>>> orf_token_discard=1**
>>>>>                     => enter GATHER state
>>>>>                   => receive initial ORF token from 1
>>>>>                      => crash in *orf_token_rtr* funcation!!
>>>>>
>>>>> I can confirmed that the value of instance->memb_state was
>>>>> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use
>>>>> following code for logging in *message_handler_orf_token* function:
>>>>>
>>>>> =code=
>>>>> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d,
>>>>> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d",
>>>>> nodeid_to_ip(token->header.nodeid, buf),
>>>>>                                 token->token_seq,
>>>>> instance->my_token_seq, token->aru, instance->my_aru, token->seq,
>>>>> instance->memb_state);
>>>>> =log output=
>>>>> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token
>>>>> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff,
>>>>> my_aru:39, seq:0, state:2
>>>>>
>>>>> It was the last log record while corosync exist.
>>>>>
>>>>> But I can't believed that corosync did not discard the ORF token when
>>>>> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync
>>>>> enter into GATHER state, orf_token_discard would be set to be 1).
>>>>>
>>>>> A reasonable guess is that the value of orf_token_discard was cleared
>>>>> by unknown reason -- maybe there exist memory corruption.
>>>>>
>>>>
>>>> possible but I have another possibility I'd like you to try if your willing.
>>>>
>>>> First, can you check the core file with gdb that is created when there
>>>> is a crash in orf_token_rtr?  What you will need to do is use up/down to
>>>> find the stack frame that has "instance" in it, then print out instance
>>>>
>>>> print *instance
>>>
>>> I forgot to print this info, it is a very good tip, thank you very much!
>>>
>>> I'm at home now, I will try it tomorrow.
> 
> The value of intstance->orf_token_discard is 1 and there does not
> exist memory corruption. The root cause of this issue result in
> corosync crashing is that our code handles join messages in
> net_deliver_fn.
> 
>>>
>>>>
>>>> This will print out a bunch of stuff, one of which will be the
>>>> orf_token_discard value.  (This will tell us if there is indeed memory
>>>> corruption or an error in our understanding of the code.
>>>>
>>>> Second, run corosync-fplay.  This will play back the flight recorder.
>>>> One thing I definitely want to know is when corosync enters gather, who
>>>> called it.  Just redirect the flight record into a file and attach to an
>>>> email.
>>>>
>>>
>>> I had run corosync-fplay tool when I observed this issue. But our
>>> cluster have been restarted again, the old version of fdata file have
>>> been destroyed.
>>>
>>>> Third, I will send a patch which _may_ fix the problem.  Also, which
>>>> version of corosync do you have?
>>>>
>>>
>>> Our code is based on corosync-v1.4.1, but I have backported necessary
>>> patchs from current master branch.
>>>
>>
>> Do you have the following commit in your repo?  If not all sorts of bad
>> things will happen with your test scenarios.
>> :
>> commit be608c050247e5f9c8266b8a0f9803cc0a3dc881
>> Author: Steven Dake <sdake@xxxxxxxxxx>
>> Date:   Tue Aug 30 22:25:21 2011 -0700
>>
>>    Ignore memb_join messages during flush operations
>>
>>    a memb_join operation that occurs during flushing can result in an
>>    entry into the GATHER state from the RECOVERY state.  This results
>> in the
>>    regular sort queue being used instead of the recovery sort queue,
>> resulting
>>    in segfault.
> 
> I have checked our code, it missed this important patch!
> Thanks so much!
> 
>>
>>    Signed-off-by: Steven Dake <sdake@xxxxxxxxxx>
>>    Reviewed-by: Jan Friesse <jfriesse@xxxxxxxxxx>
>>    (cherry picked from commit 48ffa8892daac18935d96ae46a72aebe2fb70430)
>>
>> Regards
>> -steve
>>
>>>> Regards
>>>> -steve
>>>>
>>>>>>>
>>>>>>> 3 receive commit_token(token_seq:1) from 2
>>>>>>>   => enter commit state
>>>>>>>     => send commit_token(token_seq:2) to 4
>>>>>>>       => commit token retransmit timeout occur!!
>>>>>>>         => retransmit commit_token(token_seq:2) ...
>>>>>>>         => retransmit commit_token(token_seq:2) ...
>>>>>>>         => ...
>>>>>>>           => commit token lost timeout occur!!
>>>>>>>             => enter GATHER state, multicast JoinMSG ...
>>>>>>>
>>>>>>> 4 receive commit_token(token_seq:2) from 3
>>>>>>>   => enter commit state
>>>>>>>     => send commit_token(token_seq:3) to 5
>>>>>>>       => commit token retransmit timeout occur!!
>>>>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>>>>         => retransmit commit_token(token_seq:3) to 5...
>>>>>>>         => ...
>>>>>>>
>>>>>>> 5 receive commit_token(token_seq:3) from 4
>>>>>>>   => enter commit state
>>>>>>>     => send commit_token(token_seq:4) to 1
>>>>>>>       => commit token retransmit timeout occur!!
>>>>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>>>>         => retransmit commit_token(token_seq:4) to 1...
>>>>>>>         => ...
>>>>>>>
>>>>>>> The key point in my test cast is that one of processors(3 in this
>>>>>>> case) runs into commit token timeout, and 1 receive commit token
>>>>>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*.
>>>>>>> t
>>>>>> Yunkai,
>>>>>>
>>>>>> I took a look at the code with a fresh set of eyes this morning for 3
>>>>>> hours and don't see how it is possible for processor 2 to crash in
>>>>>> orf_token_rtr with the above scenario.  The orf token will be discarded
>>>>>> in all cases in GATHER and COMMIT before executing orf_token_rtr.
>>>>>> orf_token_rtr should *never* run in GATHER or COMMIT states of the
>>>>>> processor.
>>>>>>
>>>>>> (flatiron-1.4 line numbers):
>>>>>>
>>>>>> On startup:
>>>>>> totemsrp.c:657  - memb_state = OPERATIONAL
>>>>>> totemsrp.c:4467 - when network interface detected call gather_enter
>>>>>>        state = gather, orf_token_discard = 1
>>>>>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1
>>>>>>
>>>>>> we can proceed to enter OPERATIONAL here or go directly to COMMIT
>>>>>>
>>>>>> COMMIT path:
>>>>>> totemsrp.c:4331 - COMMIT token received in GATHER state
>>>>>> (orf_token_discard still equals 1)
>>>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>>>>> (orf_token_discard still equals 1)
>>>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>>>>> equals 1)
>>>>>>
>>>>>> The only code that resets orf_token_discard to 0 is
>>>>>> totemsrp.c:2008 - entered recovery state
>>>>>>
>>>>>> Analyzing the path where totem transitions from RECOVERY -> GATHER ->
>>>>>> COMMIT:
>>>>>>
>>>>>> totemsrp.c:4276 - memb_state_gather_enter called
>>>>>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1
>>>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
>>>>>> (orf_token_discard still equals 1)
>>>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
>>>>>> equals 1)
>>>>>>
>>>>>> Given that the orf token would always be discarded by processor 2 in the
>>>>>> above test case you gave, I don't see a scenario in the code where
>>>>>> orf_token_rtr could crash in that test case.  If you can show me how
>>>>>> this happens, we could fix this up.  One possibility is some sort of
>>>>>> thread race.  Could you determine that orf_token_discard is equal to 0
>>>>>> when your failure occurs?
>>>>>>
>>>>>> As for entering operational prematurely:
>>>>>> If this is a problem, it is a separate problem from the above analysis.
>>>>>>  Lets get the above analysis fixed if it is a problem.
>>>>>>
>>>>>> Really looking forward to hearing back from you so we can fix the
>>>>>> problem you see.
>>>>>>
>>>>>> Regards
>>>>>> -steve
>>>>>>
>>>>>>>> Thanks
>>>>>>>> -steve
>>>>>>>>
>>>>>>>>
>>>>>>>>>   range = orf_token->seq - instance->my_aru;
>>>>>>>>>   assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>>>>>>>   ...
>>>>>>>>>
>>>>>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>>>>>>> ---
>>>>>>>>>  exec/totemsrp.c |    3 ++-
>>>>>>>>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>>>>>>> index 5a78962..22717ac 100644
>>>>>>>>> --- a/exec/totemsrp.c
>>>>>>>>> +++ b/exec/totemsrp.c
>>>>>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token (
>>>>>>>>>                       break;
>>>>>>>>>
>>>>>>>>>               case MEMB_STATE_RECOVERY:
>>>>>>>>> -                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) {
>>>>>>>>> +                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>>>>>>> +                             memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) {
>>>>>>>>>                               log_printf (instance->totemsrp_log_level_debug,
>>>>>>>>>                                       "Sending initial ORF token\n");
>>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> discuss mailing list
>>>>>>>> discuss@xxxxxxxxxxxx
>>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> discuss mailing list
>>>>>> discuss@xxxxxxxxxxxx
>>>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>>
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
> 
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss