I would recommend checking your tree against the latest 1.4.z stream. We _only_ fix bugs in z streams. Generally 1.4.z has fixes for all known problems we are aware of. Regards -steve On 12/13/2011 12:07 AM, Yunkai Zhang wrote: > On Tue, Dec 13, 2011 at 1:50 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >> On 12/12/2011 09:37 AM, Yunkai Zhang wrote: >>> On Mon, Dec 12, 2011 at 11:13 PM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>> On 12/12/2011 02:13 AM, Yunkai Zhang wrote: >>>>> On Mon, Dec 12, 2011 at 2:16 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>>>> On 12/10/2011 10:43 AM, Yunkai Zhang wrote: >>>>>>> On Sun, Dec 11, 2011 at 12:11 AM, Steven Dake <sdake@xxxxxxxxxx> wrote: >>>>>>>> On 12/10/2011 02:27 AM, Yunkai Zhang wrote: >>>>>>>>> After the representative of the new ring entered into RECOVERY state, >>>>>>>>> it was waiting for the commit_token on the second rotation. But when >>>>>>>>> it got a ncommit_token, it didn't check the status of this token >>>>>>>>> which could be retransmitted by upstream node on the first rotation, >>>>>>>>> and then the representative node would send the initial ORF token >>>>>>>>> prematurely. >>>>>>>>> >>>>>>>>> If other nodes which have not entered into RECOVERY state received this >>>>>>>>> incorrect initial orf_token(which seq is 0), these nodes may crash >>>>>>>>> caused by assert instraction in *orf_token_rtr* function: >>>>>>>>> ... >>>>>>>> >>>>>>>> Yunkai, >>>>>>>> >>>>>>>> First off thanks for digging into Totem. I know the protocol is hard to >>>>>>>> understand and its nice to have more people to bounce ideas off of. >>>>>>>> >>>>>>>> Do you have a test case for this? >>>>>>>> >>>>>>>> Patch is a clever solution but the root cause seems incorrect. I'd like >>>>>>>> to get to the RCA of your bug. Just thinking about some of the problems >>>>>>>> you have reported in the past (13 seconds to execute a recovery copy per >>>>>>>> node), it could be possible a retransmission of the commit token from 5 >>>>>>>> to 1 (kicking off a new orf token) in the below example could occur >>>>>>>> while in recovery. That should be ok though, since retransmitted orf >>>>>>>> tokens are discarded (totemsrp:3564). I don't see any immediate way to >>>>>>>> have a premature token transmission. >>>>>>>> >>>>>>>> Lets examine 5 processors: >>>>>>>> >>>>>>>> When in gather, a membership is built. Once this membership is agreed >>>>>>>> upon, a commit token is originated by the ring rep creating the first >>>>>>>> transmission around the ring. >>>>>>>> >>>>>>>> totemsrp.c:4044 >>>>>>>> >>>>>>>> Example with 5 processors: >>>>>>>> 1,2,3,4,5 in gather >>>>>>>> 1 originates commit token >>>>>>>> >>>>>>>> >>>>>>>> Once a commit token is received in the gather state, the processor >>>>>>>> enters commit (first token rotation) >>>>>>>> >>>>>>>> totemsrp.c:4348 >>>>>>>> >>>>>>>> Example: >>>>>>>> 1 goes to commit >>>>>>>> 2 goes to commit >>>>>>>> 3 goes to commit >>>>>>>> 4 goes to commit >>>>>>>> 5 goes to commit >>>>>>>> >>>>>>>> Once a commit token is received in the commit state, the processor >>>>>>>> enters the recovery state only if this is a new commit token >>>>>>>> >>>>>>>> totemsrp.c:4360 >>>>>>>> >>>>>>>> 1 goes to recovery >>>>>>>> 2 goes to recovery >>>>>>>> 3 goes to recovery >>>>>>>> 4 goes to recovery >>>>>>>> 5 goes to recovery >>>>>>>> 1 originates orf token (this is the start of the third rotation) >>>>>>>> >>>>>>>> totemsrp.c:4374 >>>>>>>> >>>>>>>> >>>>>>>> In the above failure free scenario, we see that we will never get a >>>>>>>> premature orf token. >>>>>>>> >>>>>>>> Perhaps the problem your running into is your mixing partition/merging >>>>>>>> while forming a new ring. This could result in my_id.addr[0] and ring >>>>>>>> rep matching, but *old* commit tokens not being discarded by the ring >>>>>>>> rep in the recovery state. >>>>>>>> >>>>>>>> I believe the code would be better served by something like: >>>>>>>> >>>>>>>> if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) && >>>>>>>> memb_commit_token->ring_id.seq == instance->my_ring_id_seq) { >>>>>>>> ... >>>>>>>> } >>>>>>>> >>>>>>>> Note it may be safer to compare the entire ring_id rather then only the seq. >>>>>>>> >>>>>>>> This will cause any commit token not from the current agreed processors >>>>>>>> to be rejected. The ring seq increases by 4 for each new ring and it is >>>>>>>> not possible to have a matching seq + rep in two separate rings (this is >>>>>>>> why we store the ring seq to stable storage). >>>>>>>> >>>>>>>> Could you test that or tell me your test case or both? >>>>>>>> >>>>>>> >>>>>>> Let me discuss my test case based on your example mentioned above: >>>>>>> >>>>>>> 1,2,3,4,5 in gather >>>>>>> >>>>>>> 1 originates commit token >>>>>>> => enter commit state >>>>>>> => send commit_token(token_seq:0) to 2 >>>>>>> => receive commit_token(token_seq:4) from 5 >>>>>>> => enter RECOVERY state >>>>>>> => receive commit_token(token_seq:4) retransmitted by 5 >>>>>>> => enter OPERATIONAL state *prematurely*!! >>>>>>> => send initial ORF toke to 2 *prematurely*!! >>>>>>> >>>>>>> 2 receive commit_token(token_seq:0) from 1 >>>>>>> =>enter commit state >>>>>>> => send commit_token(token_seq:1) to 3 >>>>>>> => commit token retransmit timeout occur!! >>>>>>> => retransmit commit_token(token_seq:1) ... >>>>>>> => retransmit commit_token(token_seq:1) ... >>>>>>> => ... >>>>>>> => receive JoinMSG from 3 >>>>>>> => enter GATHER state, ... >>>>>>> => receive initial ORF token from 1 >>>>>>> => crash in *orf_token_rtr* funcation!! >>>>> >>>>> Today, I have checked my logging file carefully again, this >>>>> description did not incorrectly: >>>>> >>>>> 2 receive commit_token(token_seq:0) from 1 >>>>> =>enter commit state >>>>> => send commit_token(token_seq:1) to 3 >>>>> => receive commit_token(token_seq:6) from 1 >>>>> => enter RECOVERY state, **set orf_token_discard=0** >>>>> => commit token retransmit timeout occur!! >>>>> => retransmit commit_token(token_seq:1) ... >>>>> => retransmit commit_token(token_seq:1) ... >>>>> => ... >>>>> => receive JoinMSG from 3 in RECOVERY state >>>>> => call memb_recovery_state_token_loss >>>>> => old_ring_state_restore, set instance->my_aru=39 >>>>> => memb_state_gather_enter (instance, 5), **set >>>>> orf_token_discard=1** >>>>> => enter GATHER state >>>>> => call memb_state_gather_enter (instance, 14), **set >>>>> orf_token_discard=1** >>>>> => enter GATHER state >>>>> => receive initial ORF token from 1 >>>>> => crash in *orf_token_rtr* funcation!! >>>>> >>>>> I can confirmed that the value of instance->memb_state was >>>>> MEMB_STATE_GATHER(=2) when 2 received initial ORF token. I use >>>>> following code for logging in *message_handler_orf_token* function: >>>>> >>>>> =code= >>>>> log_printf (LOG_DEBUG, "recv orf_token from:%s, token_seq:%d, >>>>> my_token_seq:%d, aru:%x, my_aru:%x, seq:%x, state:%d", >>>>> nodeid_to_ip(token->header.nodeid, buf), >>>>> token->token_seq, >>>>> instance->my_token_seq, token->aru, instance->my_aru, token->seq, >>>>> instance->memb_state); >>>>> =log output= >>>>> Dec 08 23:02:03 corosync [TOTEM ] totemsrp.c:3544 recv orf_token >>>>> from:10.232.134.1, token_seq:0, my_token_seq:-1, aru:ffffffff, >>>>> my_aru:39, seq:0, state:2 >>>>> >>>>> It was the last log record while corosync exist. >>>>> >>>>> But I can't believed that corosync did not discard the ORF token when >>>>> instance->memb_state equals to MEMB_STATE_GATHER (as long as corosync >>>>> enter into GATHER state, orf_token_discard would be set to be 1). >>>>> >>>>> A reasonable guess is that the value of orf_token_discard was cleared >>>>> by unknown reason -- maybe there exist memory corruption. >>>>> >>>> >>>> possible but I have another possibility I'd like you to try if your willing. >>>> >>>> First, can you check the core file with gdb that is created when there >>>> is a crash in orf_token_rtr? What you will need to do is use up/down to >>>> find the stack frame that has "instance" in it, then print out instance >>>> >>>> print *instance >>> >>> I forgot to print this info, it is a very good tip, thank you very much! >>> >>> I'm at home now, I will try it tomorrow. > > The value of intstance->orf_token_discard is 1 and there does not > exist memory corruption. The root cause of this issue result in > corosync crashing is that our code handles join messages in > net_deliver_fn. > >>> >>>> >>>> This will print out a bunch of stuff, one of which will be the >>>> orf_token_discard value. (This will tell us if there is indeed memory >>>> corruption or an error in our understanding of the code. >>>> >>>> Second, run corosync-fplay. This will play back the flight recorder. >>>> One thing I definitely want to know is when corosync enters gather, who >>>> called it. Just redirect the flight record into a file and attach to an >>>> email. >>>> >>> >>> I had run corosync-fplay tool when I observed this issue. But our >>> cluster have been restarted again, the old version of fdata file have >>> been destroyed. >>> >>>> Third, I will send a patch which _may_ fix the problem. Also, which >>>> version of corosync do you have? >>>> >>> >>> Our code is based on corosync-v1.4.1, but I have backported necessary >>> patchs from current master branch. >>> >> >> Do you have the following commit in your repo? If not all sorts of bad >> things will happen with your test scenarios. >> : >> commit be608c050247e5f9c8266b8a0f9803cc0a3dc881 >> Author: Steven Dake <sdake@xxxxxxxxxx> >> Date: Tue Aug 30 22:25:21 2011 -0700 >> >> Ignore memb_join messages during flush operations >> >> a memb_join operation that occurs during flushing can result in an >> entry into the GATHER state from the RECOVERY state. This results >> in the >> regular sort queue being used instead of the recovery sort queue, >> resulting >> in segfault. > > I have checked our code, it missed this important patch! > Thanks so much! > >> >> Signed-off-by: Steven Dake <sdake@xxxxxxxxxx> >> Reviewed-by: Jan Friesse <jfriesse@xxxxxxxxxx> >> (cherry picked from commit 48ffa8892daac18935d96ae46a72aebe2fb70430) >> >> Regards >> -steve >> >>>> Regards >>>> -steve >>>> >>>>>>> >>>>>>> 3 receive commit_token(token_seq:1) from 2 >>>>>>> => enter commit state >>>>>>> => send commit_token(token_seq:2) to 4 >>>>>>> => commit token retransmit timeout occur!! >>>>>>> => retransmit commit_token(token_seq:2) ... >>>>>>> => retransmit commit_token(token_seq:2) ... >>>>>>> => ... >>>>>>> => commit token lost timeout occur!! >>>>>>> => enter GATHER state, multicast JoinMSG ... >>>>>>> >>>>>>> 4 receive commit_token(token_seq:2) from 3 >>>>>>> => enter commit state >>>>>>> => send commit_token(token_seq:3) to 5 >>>>>>> => commit token retransmit timeout occur!! >>>>>>> => retransmit commit_token(token_seq:3) to 5... >>>>>>> => retransmit commit_token(token_seq:3) to 5... >>>>>>> => ... >>>>>>> >>>>>>> 5 receive commit_token(token_seq:3) from 4 >>>>>>> => enter commit state >>>>>>> => send commit_token(token_seq:4) to 1 >>>>>>> => commit token retransmit timeout occur!! >>>>>>> => retransmit commit_token(token_seq:4) to 1... >>>>>>> => retransmit commit_token(token_seq:4) to 1... >>>>>>> => ... >>>>>>> >>>>>>> The key point in my test cast is that one of processors(3 in this >>>>>>> case) runs into commit token timeout, and 1 receive commit token >>>>>>> retransmitted by 5, and enter OPERATIONAL state *prematurely*. >>>>>>> t >>>>>> Yunkai, >>>>>> >>>>>> I took a look at the code with a fresh set of eyes this morning for 3 >>>>>> hours and don't see how it is possible for processor 2 to crash in >>>>>> orf_token_rtr with the above scenario. The orf token will be discarded >>>>>> in all cases in GATHER and COMMIT before executing orf_token_rtr. >>>>>> orf_token_rtr should *never* run in GATHER or COMMIT states of the >>>>>> processor. >>>>>> >>>>>> (flatiron-1.4 line numbers): >>>>>> >>>>>> On startup: >>>>>> totemsrp.c:657 - memb_state = OPERATIONAL >>>>>> totemsrp.c:4467 - when network interface detected call gather_enter >>>>>> state = gather, orf_token_discard = 1 >>>>>> totemsrp.c:1870 - gather_enter sets orf_token_discard to 1 >>>>>> >>>>>> we can proceed to enter OPERATIONAL here or go directly to COMMIT >>>>>> >>>>>> COMMIT path: >>>>>> totemsrp.c:4331 - COMMIT token received in GATHER state >>>>>> (orf_token_discard still equals 1) >>>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER >>>>>> (orf_token_discard still equals 1) >>>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still >>>>>> equals 1) >>>>>> >>>>>> The only code that resets orf_token_discard to 0 is >>>>>> totemsrp.c:2008 - entered recovery state >>>>>> >>>>>> Analyzing the path where totem transitions from RECOVERY -> GATHER -> >>>>>> COMMIT: >>>>>> >>>>>> totemsrp.c:4276 - memb_state_gather_enter called >>>>>> totemsrp.c:1870 - gather enter sets orf_token_discard to 1 >>>>>> totemsrp.c:4332 - join message received in COMMIT state - enter GATHER >>>>>> (orf_token_discard still equals 1) >>>>>> totemsrp.c:3428 - orf token discarded (because orf_token_discard still >>>>>> equals 1) >>>>>> >>>>>> Given that the orf token would always be discarded by processor 2 in the >>>>>> above test case you gave, I don't see a scenario in the code where >>>>>> orf_token_rtr could crash in that test case. If you can show me how >>>>>> this happens, we could fix this up. One possibility is some sort of >>>>>> thread race. Could you determine that orf_token_discard is equal to 0 >>>>>> when your failure occurs? >>>>>> >>>>>> As for entering operational prematurely: >>>>>> If this is a problem, it is a separate problem from the above analysis. >>>>>> Lets get the above analysis fixed if it is a problem. >>>>>> >>>>>> Really looking forward to hearing back from you so we can fix the >>>>>> problem you see. >>>>>> >>>>>> Regards >>>>>> -steve >>>>>> >>>>>>>> Thanks >>>>>>>> -steve >>>>>>>> >>>>>>>> >>>>>>>>> range = orf_token->seq - instance->my_aru; >>>>>>>>> assert (range < QUEUE_RTR_ITEMS_SIZE_MAX); >>>>>>>>> ... >>>>>>>>> >>>>>>>>> Signed-off-by: Yunkai Zhang <qiushu.zyk@xxxxxxxxxx> >>>>>>>>> --- >>>>>>>>> exec/totemsrp.c | 3 ++- >>>>>>>>> 1 files changed, 2 insertions(+), 1 deletions(-) >>>>>>>>> >>>>>>>>> diff --git a/exec/totemsrp.c b/exec/totemsrp.c >>>>>>>>> index 5a78962..22717ac 100644 >>>>>>>>> --- a/exec/totemsrp.c >>>>>>>>> +++ b/exec/totemsrp.c >>>>>>>>> @@ -4372,7 +4372,8 @@ static int message_handler_memb_commit_token ( >>>>>>>>> break; >>>>>>>>> >>>>>>>>> case MEMB_STATE_RECOVERY: >>>>>>>>> - if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep)) { >>>>>>>>> + if (totemip_equal (&instance->my_id.addr[0], &instance->my_ring_id.rep) && >>>>>>>>> + memb_commit_token->token_seq == 2*memb_commit_token->addr_entries) { >>>>>>>>> log_printf (instance->totemsrp_log_level_debug, >>>>>>>>> "Sending initial ORF token\n"); >>>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> discuss mailing list >>>>>>>> discuss@xxxxxxxxxxxx >>>>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list >>>>>> discuss@xxxxxxxxxxxx >>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>> >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> discuss mailing list >>>> discuss@xxxxxxxxxxxx >>>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >>> >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss > > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss