On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: > On 12/12/2011 02:13 AM, Yunkai Zhang wrote: >> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote: >>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote: >>>>>> After the representative of the new ring entered into RECOVERY state, >>>>>> it was waiting for the commit_token on the second rotation. But when >>>>>> it got a ncommit_token, it didn't check the status of this token >>>>>> which could be retransmitted by upstream node on the first rotation, >>>>>> and then the representative node would send the initial ORF token >>>>>> prematurely. >>>>>> >>>>>> If other nodes which have not entered into RECOVERY state received this >>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash >>>>>> caused by assert instraction in *orf_token_rtr* function: >>>>>> ... >>>>> >>>>> Yunkai, >>>>> >>>>> First off thanks for digging into Totem. I know the protocol is hard to >>>>> understand and its nice to have more people to bounce ideas off of. >>>>> >>>>> Do you have a test case for this? >>>>> >>>>> Patch is a clever solution but the root cause seems incorrect. I'd like >>>>> to get to the RCA of your bug. Just thinking about some of the problems >>>>> you have reported in the past (13 seconds to execute a recovery copy per >>>>> node), it could be possible a retransmission of the commit token from 5 >>>>> to 1 (kicking off a new orf token) in the below example could occur >>>>> while in recovery. That should be ok though, since retransmitted orf >>>>> tokens are discarded (totemsrp:3564). I don't see any immediate way to >>>>> have a premature token transmission. >>>>> >>>>> Lets examine 5 processors: >>>>> >>>>> When in gather, a membership is built. Once this membership is agreed >>>>> upon, a commit token is originated by the ring rep creating the first >>>>> transmission around the ring. >>>>> >>>>> totemsrp.c:4044 >>>>> >>>>> Example with 5 processors: >>>>> 1,2,3,4,5 in gather >>>>> 1 originates commit token >>>>> >>>>> >>>>> Once a commit token is received in the gather state, the processor >>>>> enters commit (first token rotation) >>>>> >>>>> totemsrp.c:4348 >>>>> >>>>> Example: >>>>> 1 goes to commit >>>>> 2 goes to commit >>>>> 3 goes to commit >>>>> 4 goes to commit >>>>> 5 goes to commit >>>>> >>>>> Once a commit token is received in the commit state, the processor >>>>> enters the recovery state only if this is a new commit token >>>>> >>>>> totemsrp.c:4360 >>>>> >>>>> 1 goes to recovery >>>>> 2 goes to recovery >>>>> 3 goes to recovery >>>>> 4 goes to recovery >>>>> 5 goes to recovery >>>>> 1 originates orf token (this is the start of the third rotation) >>>>> >>>>> totemsrp.c:4374 >>>>> >>>>> >>>>> In the above failure free scenario, we see that we will never get a >>>>> premature orf token. >>>>> >>>>> Perhaps the problem your running into is your mixing partition/merging >>>>> while forming a new ring. This could result in my_id.addr[0] and ring >>>>> rep matching, but *old* commit tokens not being discarded by the ring >>>>> rep in the recovery state. >>>>> >>>>> I believe the code would be better served by something like: >>>>> >>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) && >>>>> memb_commit_token->ring_id.seq == instance->my_ring_id_seq) { >>>>> ... >>>>> } >>>>> >>>>> Note it may be safer to compare the entire ring_id rather then only the seq. >>>>> >>>>> This will cause any commit token not from the current agreed processors >>>>> to be rejected. The ring seq increases by 4 for each new ring and it is >>>>> not possible to have a matching seq + rep in two separate rings (this is >>>>> why we store the ring seq to stable storage). >>>>> >>>>> Could you test that or tell me your test case or both? >>>>> >>>> >>>> Let me discuss my test case based on your example mentioned above: >>>> >>>> 1,2,3,4,5 in gather >>>> >>>> 1 originates commit token >>>> => enter commit state >>>> => send commit_token(token_seq:0) to 2 >>>> => receive commit_token(token_seq:4) from 5 >>>> => enter RECOVERY state >>>> => receive commit_token(token_seq:4) retransmitted by 5 >>>> => enter OPERATIONAL state *prematurely*!! >>>> => send initial ORF toke to 2 *prematurely*!! >>>> >>>> 2 receive commit_token(token_seq:0) from 1 >>>> =>enter commit state >>>> => send commit_token(token_seq:1) to 3 >>>> => commit token retransmit timeout occur!! >>>> => retransmit commit_token(token_seq:1) ... >>>> => retransmit commit_token(token_seq:1) ... >>>> => ... >>>> => receive JoinMSG from 3 >>>> => enter GATHER state, ... >>>> => receive initial ORF token from 1 >>>> => crash in *orf_token_rtr* funcation!! >> >> Today, I have checked my logging file carefully again, this >> description did not incorrectly: >> >> 2 receive commit_token(token_seq:0) from 1 >> =>enter commit state >> => send commit_token(token_seq:1) to 3 >> => receive commit_token(token_seq:6) from 1 >> => enter RECOVERY state, **set orf_token_discard=0** >> => commit token retransmit timeout occur!! >> => retransmit commit_token(token_seq:1) ... >> => retransmit commit_token(token_seq:1) ... >> => ... >> => receive JoinMSG from 3 in RECOVERY state >> => call memb_recovery_state_token_loss >> => old_ring_state_restore, set instance->my_aru=39 >> => memb_state_gather_enter (instance, 5), **set >> orf_token_discard=1** >> => enter GATHER state >> => call memb_state_gather_enter (instance, 14), **set >> orf_token_discard=1** >> => enter GATHER state >> => receive initial ORF token from 1 >> => crash in *orf_token_rtr* funcation!! >> >> I can confirmed that the value of instance->memb_state was >> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use >> following code for logging in *message_handler_orf_token* function: >> >> =code= >> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d, >> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d", >> nodeid_to_ip(token->header.nodeid, buf), >> token->token_seq, >> instance->my_token_seq, token->aru, instance->my_aru, token->seq, >> instance->memb_state); >> =log output= >> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token >> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff, >> my_aru:39, seq:0, state:2 >> >> It was the last log record while corosync exist. >> >> But I can't believed that corosync did not discard the ORF token when >> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync >> enter into GATHER state, orf_token_discard would be set to be 1). >> >> A reasonable guess is that the value of orf_token_discard was cleared >> by unknown reason -- maybe there exist memory corruption. >> > > possible but I have another possibility I'd like you to try if your willing. > > First, can you check the core file with gdb that is created when there > is a crash in orf_token_rtr? What you will need to do is use up/down to > find the stack frame that has "instance" in it, then print out instance > > print *instance I forgot to print this info, it is a very good tip, thank you very much! I'm at home now, I will try it tomorrow. > > This will print out a bunch of stuff, one of which will be the > orf_token_discard value. (This will tell us if there is indeed memory > corruption or an error in our understanding of the code. > > Second, run corosync-fplay. This will play back the flight recorder. > One thing I definitely want to know is when corosync enters gather, who > called it. Just redirect the flight record into a file and attach to an > email. > I had run corosync-fplay tool when I observed this issue. But our cluster have been restarted again, the old version of fdata file have been destroyed. > Third, I will send a patch which _may_ fix the problem. Also, which > version of corosync do you have? > Our code is based on corosync-v1.4.1, but I have backported necessary patchs from current master branch. > Regards > -steve > >>>> >>>> 3 receive commit_token(token_seq:1) from 2 >>>> => enter commit state >>>> => send commit_token(token_seq:2) to 4 >>>> => commit token retransmit timeout occur!! >>>> => retransmit commit_token(token_seq:2) ... >>>> => retransmit commit_token(token_seq:2) ... >>>> => ... >>>> => commit token lost timeout occur!! >>>> => enter GATHER state, multicast JoinMSG ... >>>> >>>> 4 receive commit_token(token_seq:2) from 3 >>>> => enter commit state >>>> => send commit_token(token_seq:3) to 5 >>>> => commit token retransmit timeout occur!! >>>> => retransmit commit_token(token_seq:3) to 5... >>>> => retransmit commit_token(token_seq:3) to 5... >>>> => ... >>>> >>>> 5 receive commit_token(token_seq:3) from 4 >>>> => enter commit state >>>> => send commit_token(token_seq:4) to 1 >>>> => commit token retransmit timeout occur!! >>>> => retransmit commit_token(token_seq:4) to 1... >>>> => retransmit commit_token(token_seq:4) to 1... >>>> => ... >>>> >>>> The key point in my test cast is that one of processors(3 in this >>>> case) runs into commit token timeout, and 1 receive commit token >>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*. >>>> t >>> Yunkai, >>> >>> I took a look at the code with a fresh set of eyes this morning for 3 >>> hours and don't see how it is possible for processor 2 to crash in >>> orf_token_rtr with the above scenario. The orf token will be discarded >>> in all cases in GATHER and COMMIT before executing orf_token_rtr. >>> orf_token_rtr should *never* run in GATHER or COMMIT states of the >>> processor. >>> >>> (flatiron-1.4 line numbers): >>> >>> On startup: >>> totemsrp.c:657 - memb_state = OPERATIONAL >>> totemsrp.c:4467 - when network interface detected call gather_enter >>> state = gather, orf_token_discard = 1 >>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1 >>> >>> we can proceed to enter OPERATIONAL here or go directly to COMMIT >>> >>> COMMIT path: >>> totemsrp.c:4331 - COMMIT token received in GATHER state >>> (orf_token_discard still equals 1) >>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER >>> (orf_token_discard still equals 1) >>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still >>> equals 1) >>> >>> The only code that resets orf_token_discard to 0 is >>> totemsrp.c:2008 - entered recovery state >>> >>> Analyzing the path where totem transitions from RECOVERY -> GATHER -> >>> COMMIT: >>> >>> totemsrp.c:4276 - memb_state_gather_enter called >>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1 >>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER >>> (orf_token_discard still equals 1) >>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still >>> equals 1) >>> >>> Given that the orf token would always be discarded by processor 2 in the >>> above test case you gave, I don't see a scenario in the code where >>> orf_token_rtr could crash in that test case. If you can show me how >>> this happens, we could fix this up. One possibility is some sort of >>> thread race. Could you determine that orf_token_discard is equal to 0 >>> when your failure occurs? >>> >>> As for entering operational prematurely: >>> If this is a problem, it is a separate problem from the above analysis. >>> Lets get the above analysis fixed if it is a problem. >>> >>> Really looking forward to hearing back from you so we can fix the >>> problem you see. >>> >>> Regards >>> -steve >>> >>>>> Thanks >>>>> -steve >>>>> >>>>> >>>>>> range = orf_token->seq - instance->my_aru; >>>>>> assert (range < QUEUE_RTR_ITEMS_SIZE_MAX); >>>>>> ... >>>>>> >>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>>>> --- >>>>>> exec/totemsrp.c | 3 ++- >>>>>> 1 files changed, 2 insertions(+), 1 deletions(-) >>>>>> >>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >>>>>> index 5a78962..22717ac 100644 >>>>>> --- a/exec/totemsrp.c >>>>>> +++ b/exec/totemsrp.c >>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token ( >>>>>> break; >>>>>> >>>>>> case MEMB_STATE_RECOVERY: >>>>>> - if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) { >>>>>> + if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) && >>>>>> + memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) { >>>>>> log_printf (instance->totemsrp_log_level_debug, >>>>>> "Sending initial ORF token\n"); >>>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list >>>>> discuss@xxxxxxxxxxxx >>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>>> >>>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >> >> >> > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss -- Yunkai Zhang Work at Taobao _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss