On Tue, Dec 13, 2011 at 1:50 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: > On 12/12/2011 09:37 AM, Yunkai Zhang wrote: >> On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>> On 12/12/2011 02:13 AM, Yunkai Zhang wrote: >>>> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote: >>>>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote: >>>>>>>> After the representative of the new ring entered into RECOVERY state, >>>>>>>> it was waiting for the commit_token on the second rotation. But when >>>>>>>> it got a ncommit_token, it didn't check the status of this token >>>>>>>> which could be retransmitted by upstream node on the first rotation, >>>>>>>> and then the representative node would send the initial ORF token >>>>>>>> prematurely. >>>>>>>> >>>>>>>> If other nodes which have not entered into RECOVERY state received this >>>>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash >>>>>>>> caused by assert instraction in *orf_token_rtr* function: >>>>>>>> ... >>>>>>> >>>>>>> Yunkai, >>>>>>> >>>>>>> First off thanks for digging into Totem. I know the protocol is hard to >>>>>>> understand and its nice to have more people to bounce ideas off of. >>>>>>> >>>>>>> Do you have a test case for this? >>>>>>> >>>>>>> Patch is a clever solution but the root cause seems incorrect. I'd like >>>>>>> to get to the RCA of your bug. Just thinking about some of the problems >>>>>>> you have reported in the past (13 seconds to execute a recovery copy per >>>>>>> node), it could be possible a retransmission of the commit token from 5 >>>>>>> to 1 (kicking off a new orf token) in the below example could occur >>>>>>> while in recovery. That should be ok though, since retransmitted orf >>>>>>> tokens are discarded (totemsrp:3564). I don't see any immediate way to >>>>>>> have a premature token transmission. >>>>>>> >>>>>>> Lets examine 5 processors: >>>>>>> >>>>>>> When in gather, a membership is built. Once this membership is agreed >>>>>>> upon, a commit token is originated by the ring rep creating the first >>>>>>> transmission around the ring. >>>>>>> >>>>>>> totemsrp.c:4044 >>>>>>> >>>>>>> Example with 5 processors: >>>>>>> 1,2,3,4,5 in gather >>>>>>> 1 originates commit token >>>>>>> >>>>>>> >>>>>>> Once a commit token is received in the gather state, the processor >>>>>>> enters commit (first token rotation) >>>>>>> >>>>>>> totemsrp.c:4348 >>>>>>> >>>>>>> Example: >>>>>>> 1 goes to commit >>>>>>> 2 goes to commit >>>>>>> 3 goes to commit >>>>>>> 4 goes to commit >>>>>>> 5 goes to commit >>>>>>> >>>>>>> Once a commit token is received in the commit state, the processor >>>>>>> enters the recovery state only if this is a new commit token >>>>>>> >>>>>>> totemsrp.c:4360 >>>>>>> >>>>>>> 1 goes to recovery >>>>>>> 2 goes to recovery >>>>>>> 3 goes to recovery >>>>>>> 4 goes to recovery >>>>>>> 5 goes to recovery >>>>>>> 1 originates orf token (this is the start of the third rotation) >>>>>>> >>>>>>> totemsrp.c:4374 >>>>>>> >>>>>>> >>>>>>> In the above failure free scenario, we see that we will never get a >>>>>>> premature orf token. >>>>>>> >>>>>>> Perhaps the problem your running into is your mixing partition/merging >>>>>>> while forming a new ring. This could result in my_id.addr[0] and ring >>>>>>> rep matching, but *old* commit tokens not being discarded by the ring >>>>>>> rep in the recovery state. >>>>>>> >>>>>>> I believe the code would be better served by something like: >>>>>>> >>>>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) && >>>>>>> memb_commit_token->ring_id.seq == instance->my_ring_id_seq) { >>>>>>> ... >>>>>>> } >>>>>>> >>>>>>> Note it may be safer to compare the entire ring_id rather then only the seq. >>>>>>> >>>>>>> This will cause any commit token not from the current agreed processors >>>>>>> to be rejected. The ring seq increases by 4 for each new ring and it is >>>>>>> not possible to have a matching seq + rep in two separate rings (this is >>>>>>> why we store the ring seq to stable storage). >>>>>>> >>>>>>> Could you test that or tell me your test case or both? >>>>>>> >>>>>> >>>>>> Let me discuss my test case based on your example mentioned above: >>>>>> >>>>>> 1,2,3,4,5 in gather >>>>>> >>>>>> 1 originates commit token >>>>>> => enter commit state >>>>>> => send commit_token(token_seq:0) to 2 >>>>>> => receive commit_token(token_seq:4) from 5 >>>>>> => enter RECOVERY state >>>>>> => receive commit_token(token_seq:4) retransmitted by 5 >>>>>> => enter OPERATIONAL state *prematurely*!! >>>>>> => send initial ORF toke to 2 *prematurely*!! >>>>>> >>>>>> 2 receive commit_token(token_seq:0) from 1 >>>>>> =>enter commit state >>>>>> => send commit_token(token_seq:1) to 3 >>>>>> => commit token retransmit timeout occur!! >>>>>> => retransmit commit_token(token_seq:1) ... >>>>>> => retransmit commit_token(token_seq:1) ... >>>>>> => ... >>>>>> => receive JoinMSG from 3 >>>>>> => enter GATHER state, ... >>>>>> => receive initial ORF token from 1 >>>>>> => crash in *orf_token_rtr* funcation!! >>>> >>>> Today, I have checked my logging file carefully again, this >>>> description did not incorrectly: >>>> >>>> 2 receive commit_token(token_seq:0) from 1 >>>> =>enter commit state >>>> => send commit_token(token_seq:1) to 3 >>>> => receive commit_token(token_seq:6) from 1 >>>> => enter RECOVERY state, **set orf_token_discard=0** >>>> => commit token retransmit timeout occur!! >>>> => retransmit commit_token(token_seq:1) ... >>>> => retransmit commit_token(token_seq:1) ... >>>> => ... >>>> => receive JoinMSG from 3 in RECOVERY state >>>> => call memb_recovery_state_token_loss >>>> => old_ring_state_restore, set instance->my_aru=39 >>>> => memb_state_gather_enter (instance, 5), **set >>>> orf_token_discard=1** >>>> => enter GATHER state >>>> => call memb_state_gather_enter (instance, 14), **set >>>> orf_token_discard=1** >>>> => enter GATHER state >>>> => receive initial ORF token from 1 >>>> => crash in *orf_token_rtr* funcation!! >>>> >>>> I can confirmed that the value of instance->memb_state was >>>> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use >>>> following code for logging in *message_handler_orf_token* function: >>>> >>>> =code= >>>> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d, >>>> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d", >>>> nodeid_to_ip(token->header.nodeid, buf), >>>> token->token_seq, >>>> instance->my_token_seq, token->aru, instance->my_aru, token->seq, >>>> instance->memb_state); >>>> =log output= >>>> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token >>>> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff, >>>> my_aru:39, seq:0, state:2 >>>> >>>> It was the last log record while corosync exist. >>>> >>>> But I can't believed that corosync did not discard the ORF token when >>>> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync >>>> enter into GATHER state, orf_token_discard would be set to be 1). >>>> >>>> A reasonable guess is that the value of orf_token_discard was cleared >>>> by unknown reason -- maybe there exist memory corruption. >>>> >>> >>> possible but I have another possibility I'd like you to try if your willing. >>> >>> First, can you check the core file with gdb that is created when there >>> is a crash in orf_token_rtr? What you will need to do is use up/down to >>> find the stack frame that has "instance" in it, then print out instance >>> >>> print *instance >> >> I forgot to print this info, it is a very good tip, thank you very much! >> >> I'm at home now, I will try it tomorrow. The value of intstance->orf_token_discard is 1 and there does not exist memory corruption. The root cause of this issue result in corosync crashing is that our code handles join messages in net_deliver_fn. >> >>> >>> This will print out a bunch of stuff, one of which will be the >>> orf_token_discard value. (This will tell us if there is indeed memory >>> corruption or an error in our understanding of the code. >>> >>> Second, run corosync-fplay. This will play back the flight recorder. >>> One thing I definitely want to know is when corosync enters gather, who >>> called it. Just redirect the flight record into a file and attach to an >>> email. >>> >> >> I had run corosync-fplay tool when I observed this issue. But our >> cluster have been restarted again, the old version of fdata file have >> been destroyed. >> >>> Third, I will send a patch which _may_ fix the problem. Also, which >>> version of corosync do you have? >>> >> >> Our code is based on corosync-v1.4.1, but I have backported necessary >> patchs from current master branch. >> > > Do you have the following commit in your repo? If not all sorts of bad > things will happen with your test scenarios. > : > commit be608c050247e5f9c8266b8a0f9803cc0a3dc881 > Author: Steven Dake <sdake@xxxxxxxxxx> > Date: Tue Aug 30 22:25:21 2011 -0700 > > Ignore memb_join messages during flush operations > > a memb_join operation that occurs during flushing can result in an > entry into the GATHER state from the RECOVERY state. This results > in the > regular sort queue being used instead of the recovery sort queue, > resulting > in segfault. I have checked our code, it missed this important patch! Thanks so much! > > Signed-off-by: Steven Dake <sdake@xxxxxxxxxx> > Reviewed-by: Jan Friesse <jfriesse@xxxxxxxxxx> > (cherry picked from commit 48ffa8892daac18935d96ae46a72aebe2fb70430) > > Regards > -steve > >>> Regards >>> -steve >>> >>>>>> >>>>>> 3 receive commit_token(token_seq:1) from 2 >>>>>> => enter commit state >>>>>> => send commit_token(token_seq:2) to 4 >>>>>> => commit token retransmit timeout occur!! >>>>>> => retransmit commit_token(token_seq:2) ... >>>>>> => retransmit commit_token(token_seq:2) ... >>>>>> => ... >>>>>> => commit token lost timeout occur!! >>>>>> => enter GATHER state, multicast JoinMSG ... >>>>>> >>>>>> 4 receive commit_token(token_seq:2) from 3 >>>>>> => enter commit state >>>>>> => send commit_token(token_seq:3) to 5 >>>>>> => commit token retransmit timeout occur!! >>>>>> => retransmit commit_token(token_seq:3) to 5... >>>>>> => retransmit commit_token(token_seq:3) to 5... >>>>>> => ... >>>>>> >>>>>> 5 receive commit_token(token_seq:3) from 4 >>>>>> => enter commit state >>>>>> => send commit_token(token_seq:4) to 1 >>>>>> => commit token retransmit timeout occur!! >>>>>> => retransmit commit_token(token_seq:4) to 1... >>>>>> => retransmit commit_token(token_seq:4) to 1... >>>>>> => ... >>>>>> >>>>>> The key point in my test cast is that one of processors(3 in this >>>>>> case) runs into commit token timeout, and 1 receive commit token >>>>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*. >>>>>> t >>>>> Yunkai, >>>>> >>>>> I took a look at the code with a fresh set of eyes this morning for 3 >>>>> hours and don't see how it is possible for processor 2 to crash in >>>>> orf_token_rtr with the above scenario. The orf token will be discarded >>>>> in all cases in GATHER and COMMIT before executing orf_token_rtr. >>>>> orf_token_rtr should *never* run in GATHER or COMMIT states of the >>>>> processor. >>>>> >>>>> (flatiron-1.4 line numbers): >>>>> >>>>> On startup: >>>>> totemsrp.c:657 - memb_state = OPERATIONAL >>>>> totemsrp.c:4467 - when network interface detected call gather_enter >>>>> state = gather, orf_token_discard = 1 >>>>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1 >>>>> >>>>> we can proceed to enter OPERATIONAL here or go directly to COMMIT >>>>> >>>>> COMMIT path: >>>>> totemsrp.c:4331 - COMMIT token received in GATHER state >>>>> (orf_token_discard still equals 1) >>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER >>>>> (orf_token_discard still equals 1) >>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still >>>>> equals 1) >>>>> >>>>> The only code that resets orf_token_discard to 0 is >>>>> totemsrp.c:2008 - entered recovery state >>>>> >>>>> Analyzing the path where totem transitions from RECOVERY -> GATHER -> >>>>> COMMIT: >>>>> >>>>> totemsrp.c:4276 - memb_state_gather_enter called >>>>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1 >>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER >>>>> (orf_token_discard still equals 1) >>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still >>>>> equals 1) >>>>> >>>>> Given that the orf token would always be discarded by processor 2 in the >>>>> above test case you gave, I don't see a scenario in the code where >>>>> orf_token_rtr could crash in that test case. If you can show me how >>>>> this happens, we could fix this up. One possibility is some sort of >>>>> thread race. Could you determine that orf_token_discard is equal to 0 >>>>> when your failure occurs? >>>>> >>>>> As for entering operational prematurely: >>>>> If this is a problem, it is a separate problem from the above analysis. >>>>> Lets get the above analysis fixed if it is a problem. >>>>> >>>>> Really looking forward to hearing back from you so we can fix the >>>>> problem you see. >>>>> >>>>> Regards >>>>> -steve >>>>> >>>>>>> Thanks >>>>>>> -steve >>>>>>> >>>>>>> >>>>>>>> range = orf_token->seq - instance->my_aru; >>>>>>>> assert (range < QUEUE_RTR_ITEMS_SIZE_MAX); >>>>>>>> ... >>>>>>>> >>>>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>>>>>> --- >>>>>>>> exec/totemsrp.c | 3 ++- >>>>>>>> 1 files changed, 2 insertions(+), 1 deletions(-) >>>>>>>> >>>>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >>>>>>>> index 5a78962..22717ac 100644 >>>>>>>> --- a/exec/totemsrp.c >>>>>>>> +++ b/exec/totemsrp.c >>>>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token ( >>>>>>>> break; >>>>>>>> >>>>>>>> case MEMB_STATE_RECOVERY: >>>>>>>> - if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) { >>>>>>>> + if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) && >>>>>>>> + memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) { >>>>>>>> log_printf (instance->totemsrp_log_level_debug, >>>>>>>> "Sending initial ORF token\n"); >>>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list >>>>>>> discuss@xxxxxxxxxxxx >>>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> discuss mailing list >>>>> discuss@xxxxxxxxxxxx >>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>>> >>>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >> >> >> > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss -- Yunkai Zhang Work at Taobao _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss