Re: [PATCH] Ignore the commit_token retransmitted by upstream node on the first rotation

Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> · Mon, 12 Dec 2011 17:13:16 +0800

On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
> On 12/10/2011 10:43 AM, Yunkai Zhang wrote:
>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote:
>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote:
>>>> After the representative of the new ring entered into RECOVERY state,
>>>> it was waiting for the commit_token on the second rotation. But when
>>>> it got a ncommit_token, it didn't check the status of this token
>>>> which could be retransmitted by upstream node on the first rotation,
>>>> and then the representative node would send the initial ORF token
>>>> prematurely.
>>>>
>>>> If other nodes which have not entered into RECOVERY state received this
>>>> incorrect initial orf_token(which seq is 0), these nodes may crash
>>>> caused by assert instraction in *orf_token_rtr* function:
>>>>   ...
>>>
>>> Yunkai,
>>>
>>> First off thanks for digging into Totem.  I know the protocol is hard to
>>> understand and its nice to have more people to bounce ideas off of.
>>>
>>> Do you have a test case for this?
>>>
>>> Patch is a clever solution but the root cause seems incorrect.  I'd like
>>> to get to the RCA of your bug.  Just thinking about some of the problems
>>> you have reported in the past (13 seconds to execute a recovery copy per
>>> node), it could be possible a retransmission of the commit token from 5
>>> to 1 (kicking off a new orf token) in the below example could occur
>>> while in recovery.  That should be ok though, since retransmitted orf
>>> tokens are discarded (totemsrp:3564).  I don't see any immediate way to
>>> have a premature token transmission.
>>>
>>> Lets examine 5 processors:
>>>
>>> When in gather, a membership is built.  Once this membership is agreed
>>> upon, a commit token is originated by the ring rep creating the first
>>> transmission around the ring.
>>>
>>> totemsrp.c:4044
>>>
>>> Example with 5 processors:
>>> 1,2,3,4,5 in gather
>>> 1 originates commit token
>>>
>>>
>>> Once a commit token is received in the gather state, the processor
>>> enters commit (first token rotation)
>>>
>>> totemsrp.c:4348
>>>
>>> Example:
>>> 1 goes to commit
>>> 2 goes to commit
>>> 3 goes to commit
>>> 4 goes to commit
>>> 5 goes to commit
>>>
>>> Once a commit token is received in the commit state, the processor
>>> enters the recovery state only if this is a new commit token
>>>
>>> totemsrp.c:4360
>>>
>>> 1 goes to recovery
>>> 2 goes to recovery
>>> 3 goes to recovery
>>> 4 goes to recovery
>>> 5 goes to recovery
>>> 1 originates orf token (this is the start of the third rotation)
>>>
>>> totemsrp.c:4374
>>>
>>>
>>> In the above failure free scenario, we see that we will never get a
>>> premature orf token.
>>>
>>> Perhaps the problem your running into is your mixing partition/merging
>>> while forming a new ring.  This could result in my_id.addr[0] and ring
>>> rep matching, but *old* commit tokens not being discarded by the ring
>>> rep in the recovery state.
>>>
>>> I believe the code would be better served by something like:
>>>
>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>    memb_commit_token->ring_id.seq == instance->my_ring_id_seq) {
>>> ...
>>> }
>>>
>>> Note it may be safer to compare the entire ring_id rather then only the seq.
>>>
>>> This will cause any commit token not from the current agreed processors
>>> to be rejected.  The ring seq increases by 4 for each new ring and it is
>>> not possible to have a matching seq + rep in two separate rings (this is
>>> why we store the ring seq to stable storage).
>>>
>>> Could you test that or tell me your test case or both?
>>>
>>
>> Let me discuss my test case based on your example mentioned above:
>>
>> 1,2,3,4,5 in gather
>>
>> 1 originates commit token
>>   => enter commit state
>>     => send commit_token(token_seq:0) to 2
>>       => receive commit_token(token_seq:4) from 5
>>         => enter RECOVERY state
>>           => receive commit_token(token_seq:4) retransmitted by 5
>>             => enter OPERATIONAL state *prematurely*!!
>>               => send initial ORF toke to 2 *prematurely*!!
>>
>> 2 receive commit_token(token_seq:0) from 1
>>   =>enter commit state
>>     => send commit_token(token_seq:1) to 3
>>       => commit token retransmit timeout occur!!
>>         => retransmit commit_token(token_seq:1) ...
>>         => retransmit commit_token(token_seq:1) ...
>>         => ...
>>           => receive JoinMSG from 3
>>             => enter GATHER state, ...
>>               => receive initial ORF token from 1
>>                 => crash in *orf_token_rtr* funcation!!

Today, I have checked my logging file carefully again, this
description did not incorrectly:

2 receive commit_token(token_seq:0) from 1
   =>enter commit state
     => send commit_token(token_seq:1) to 3
       => receive commit_token(token_seq:6) from 1
         => enter RECOVERY state, **set orf_token_discard=0**
           => commit token retransmit timeout occur!!
             => retransmit commit_token(token_seq:1) ...
             => retransmit commit_token(token_seq:1) ...
             => ...
               => receive JoinMSG from 3 in RECOVERY state
                 => call memb_recovery_state_token_loss
                    => old_ring_state_restore, set instance->my_aru=39
                    => memb_state_gather_enter (instance, 5), **set
orf_token_discard=1**
                      => enter GATHER state
                 => call memb_state_gather_enter (instance, 14), **set
orf_token_discard=1**
                    => enter GATHER state
                  => receive initial ORF token from 1
                     => crash in *orf_token_rtr* funcation!!

I can confirmed that the value of instance->memb_state was
MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use
following code for logging in *message_handler_orf_token* function:

=code=
log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d,
my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d",
nodeid_to_ip(token->header.nodeid, buf),
                                token->token_seq,
instance->my_token_seq, token->aru, instance->my_aru, token->seq,
instance->memb_state);
=log output=
Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token
from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff,
my_aru:39, seq:0, state:2

It was the last log record while corosync exist.

But I can't believed that corosync did not discard the ORF token when
instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync
enter into GATHER state, orf_token_discard would be set to be 1).

A reasonable guess is that the value of orf_token_discard was cleared
by unknown reason -- maybe there exist memory corruption.

>>
>> 3 receive commit_token(token_seq:1) from 2
>>   => enter commit state
>>     => send commit_token(token_seq:2) to 4
>>       => commit token retransmit timeout occur!!
>>         => retransmit commit_token(token_seq:2) ...
>>         => retransmit commit_token(token_seq:2) ...
>>         => ...
>>           => commit token lost timeout occur!!
>>             => enter GATHER state, multicast JoinMSG ...
>>
>> 4 receive commit_token(token_seq:2) from 3
>>   => enter commit state
>>     => send commit_token(token_seq:3) to 5
>>       => commit token retransmit timeout occur!!
>>         => retransmit commit_token(token_seq:3) to 5...
>>         => retransmit commit_token(token_seq:3) to 5...
>>         => ...
>>
>> 5 receive commit_token(token_seq:3) from 4
>>   => enter commit state
>>     => send commit_token(token_seq:4) to 1
>>       => commit token retransmit timeout occur!!
>>         => retransmit commit_token(token_seq:4) to 1...
>>         => retransmit commit_token(token_seq:4) to 1...
>>         => ...
>>
>> The key point in my test cast is that one of processors(3 in this
>> case) runs into commit token timeout, and 1 receive commit token
>> retransmitted by 5, and enter OPERATIONAL state *prematurely*.
>> t
> Yunkai,
>
> I took a look at the code with a fresh set of eyes this morning for 3
> hours and don't see how it is possible for processor 2 to crash in
> orf_token_rtr with the above scenario.  The orf token will be discarded
> in all cases in GATHER and COMMIT before executing orf_token_rtr.
> orf_token_rtr should *never* run in GATHER or COMMIT states of the
> processor.
>
> (flatiron-1.4 line numbers):
>
> On startup:
> totemsrp.c:657  - memb_state = OPERATIONAL
> totemsrp.c:4467 - when network interface detected call gather_enter
>        state = gather, orf_token_discard = 1
> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1
>
> we can proceed to enter OPERATIONAL here or go directly to COMMIT
>
> COMMIT path:
> totemsrp.c:4331 - COMMIT token received in GATHER state
> (orf_token_discard still equals 1)
> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
> (orf_token_discard still equals 1)
> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
> equals 1)
>
> The only code that resets orf_token_discard to 0 is
> totemsrp.c:2008 - entered recovery state
>
> Analyzing the path where totem transitions from RECOVERY -> GATHER ->
> COMMIT:
>
> totemsrp.c:4276 - memb_state_gather_enter called
> totemsrp.c:1870 - gather enter sets orf_token_discard to 1
> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER
> (orf_token_discard still equals 1)
> totemsrp.c:3428 - orf token discarded (because orf_token_discard still
> equals 1)
>
> Given that the orf token would always be discarded by processor 2 in the
> above test case you gave, I don't see a scenario in the code where
> orf_token_rtr could crash in that test case.  If you can show me how
> this happens, we could fix this up.  One possibility is some sort of
> thread race.  Could you determine that orf_token_discard is equal to 0
> when your failure occurs?
>
> As for entering operational prematurely:
> If this is a problem, it is a separate problem from the above analysis.
>  Lets get the above analysis fixed if it is a problem.
>
> Really looking forward to hearing back from you so we can fix the
> problem you see.
>
> Regards
> -steve
>
>>> Thanks
>>> -steve
>>>
>>>
>>>>   range = orf_token->seq - instance->my_aru;
>>>>   assert (range < QUEUE_RTR_ITEMS_SIZE_MAX);
>>>>   ...
>>>>
>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx>
>>>> ---
>>>>  exec/totemsrp.c |    3 ++-
>>>>  1 files changed, 2 insertions(+), 1 deletions(-)
>>>>
>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c
>>>> index 5a78962..22717ac 100644
>>>> --- a/exec/totemsrp.c
>>>> +++ b/exec/totemsrp.c
>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token (
>>>>                       break;
>>>>
>>>>               case MEMB_STATE_RECOVERY:
>>>> -                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) {
>>>> +                     if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) &&
>>>> +                             memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) {
>>>>                               log_printf (instance->totemsrp_log_level_debug,
>>>>                                       "Sending initial ORF token\n");
>>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>>
>>
>
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

-- 
Yunkai Zhang
Work at Taobao
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss