Re: Bug when start/stop openais repeatedly

Jan Friesse <jfriesse@xxxxxxxxxx> · Wed, 05 Jun 2013 16:12:51 +0200

Jiaju, Lidong,
thanks for patch and description. ACK from me and I've also commited to svn.

Regards,
  Honza

Jiaju Zhang napsal(a):
> Hi Jan,
> 
> I can provide some background of this patch.
> 
> The root cause of this issue is if two membership changes are too
> close, before finishing the execution of all the callbacks initiated
> by the first membership change, the second membership change arrived.
> Since some internal status of those callbacks has not been reset to
> initial state, re-enter those callbacks will cause problems.
> 
> The problematic callback here is the ckpt_confchg_fn in openais CKPT
> service. The mentioned internal status is my_sync_state. So, before
> my_sync_state goes back to SYNC_STATE_NOT_STARTED, another membership
> change happens, this results in the new callback is wrongly rejected
> and will never execute again. Thus, that CKPT cannot be sycned
> normally and end up with refcounting of a particular checkpoint is not
> correct in the cluster so being deleted then. This is why the user of
> CKPT service - ocfs2_controld sees the "Object does not exist", but it
> should exist. The solution in that patch is to judge the ringid first.
> That being said, if the ringid is not the same, that indicates it is a
> new membership change, so it is not re-entering an ongoing
> ckpt_confchg_fn, but needs to start a new ckpt_confchg_fn indeed.
> 
> The second part in that patch which needs to be mentioned here is the
> changing of when to assign value to my_saved_ring_id. When the totem
> configuration type is still TOTEM_CONFIGURATION_TRANSITIONAL, it seems
> to be not very precise if assigning a new ringid value to
> my_saved_ring_id, since the old configuration change has not been
> finished yet.
> 
> Since this is a random issue, this bug cannot be reproduced every
> time when you stop/start the whole cluster stack. However, with that
> testing script, it can be reproduced regularly not exceeding 300 times
> of loops in our environment.
> 
> Thanks,
> Jiaju
> 
> On Fri, 2013-04-26 at 12:20 +0200, Jan Friesse wrote:
>> Lidong,
>> thanks for patch. Can you please send me your analysis? I would really
>> like to understand root case, so now this patch helps.
>>
>> Regards,
>>   Honza
>>
>> Lidong Zhong napsal(a):
>>> Hi,
>>>    We built a cluster consist of three nodes and start/stop one of these nodes repeatedly. The test script is shown 
>>> like this:
>>>   1 #!/bin/sh                                                                   
>>>   2 
>>>   3 LOOP_COUNT=1000
>>>   4 
>>>   5 while [ $LOOP_COUNT -gt 0 ];
>>>   6 do
>>>   7     let "LOOP_COUNT-=1"
>>>   8     echo "test No. $((1000-LOOP_COUNT))"
>>>   9     rcopenais start
>>>  10     sleep 30
>>>  11     rcopenais stop
>>>  12     sleep 10
>>>  13 done
>>>
>>> The error log looks like:
>>> Apr  3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist
>>> Apr  3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist
>>> Several times after this error appears first, it leads to this node being fenced.
>>> After some analysis, we think there is a race condition between corosync and openais CKPT service. So we formed 
>>> a patch which can avoid this problem effectively.
>>> The patch is attached below. Any review is highly appreciated.
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
> 
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss