Hi Jan, I can provide some background of this patch. The root cause of this issue is if two membership changes are too close, before finishing the execution of all the callbacks initiated by the first membership change, the second membership change arrived. Since some internal status of those callbacks has not been reset to initial state, re-enter those callbacks will cause problems. The problematic callback here is the ckpt_confchg_fn in openais CKPT service. The mentioned internal status is my_sync_state. So, before my_sync_state goes back to SYNC_STATE_NOT_STARTED, another membership change happens, this results in the new callback is wrongly rejected and will never execute again. Thus, that CKPT cannot be sycned normally and end up with refcounting of a particular checkpoint is not correct in the cluster so being deleted then. This is why the user of CKPT service - ocfs2_controld sees the "Object does not exist", but it should exist. The solution in that patch is to judge the ringid first. That being said, if the ringid is not the same, that indicates it is a new membership change, so it is not re-entering an ongoing ckpt_confchg_fn, but needs to start a new ckpt_confchg_fn indeed. The second part in that patch which needs to be mentioned here is the changing of when to assign value to my_saved_ring_id. When the totem configuration type is still TOTEM_CONFIGURATION_TRANSITIONAL, it seems to be not very precise if assigning a new ringid value to my_saved_ring_id, since the old configuration change has not been finished yet. Since this is a random issue, this bug cannot be reproduced every time when you stop/start the whole cluster stack. However, with that testing script, it can be reproduced regularly not exceeding 300 times of loops in our environment. Thanks, Jiaju On Fri, 2013-04-26 at 12:20 +0200, Jan Friesse wrote: > Lidong, > thanks for patch. Can you please send me your analysis? I would really > like to understand root case, so now this patch helps. > > Regards, > Honza > > Lidong Zhong napsal(a): > > Hi, > > We built a cluster consist of three nodes and start/stop one of these nodes repeatedly. The test script is shown > > like this: > > 1 #!/bin/sh > > 2 > > 3 LOOP_COUNT=1000 > > 4 > > 5 while [ $LOOP_COUNT -gt 0 ]; > > 6 do > > 7 let "LOOP_COUNT-=1" > > 8 echo "test No. $((1000-LOOP_COUNT))" > > 9 rcopenais start > > 10 sleep 30 > > 11 rcopenais stop > > 12 sleep 10 > > 13 done > > > > The error log looks like: > > Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist > > Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist > > Several times after this error appears first, it leads to this node being fenced. > > After some analysis, we think there is a race condition between corosync and openais CKPT service. So we formed > > a patch which can avoid this problem effectively. > > The patch is attached below. Any review is highly appreciated. > > Thanks > > > > > > > > > > > > > > > > > > _______________________________________________ > > discuss mailing list > > discuss@xxxxxxxxxxxx > > http://lists.corosync.org/mailman/listinfo/discuss > > _______________________________________________ > discuss mailing list > discuss@xxxxxxxxxxxx > http://lists.corosync.org/mailman/listinfo/discuss _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss