Jiaju, Lidong, thanks for patch and description. ACK from me and I've also commited to svn. Regards, Honza Jiaju Zhang napsal(a): > Hi Jan, > > I can provide some background of this patch. > > The root cause of this issue is if two membership changes are too > close, before finishing the execution of all the callbacks initiated > by the first membership change, the second membership change arrived. > Since some internal status of those callbacks has not been reset to > initial state, re-enter those callbacks will cause problems. > > The problematic callback here is the ckpt_confchg_fn in openais CKPT > service. The mentioned internal status is my_sync_state. So, before > my_sync_state goes back to SYNC_STATE_NOT_STARTED, another membership > change happens, this results in the new callback is wrongly rejected > and will never execute again. Thus, that CKPT cannot be sycned > normally and end up with refcounting of a particular checkpoint is not > correct in the cluster so being deleted then. This is why the user of > CKPT service - ocfs2_controld sees the "Object does not exist", but it > should exist. The solution in that patch is to judge the ringid first. > That being said, if the ringid is not the same, that indicates it is a > new membership change, so it is not re-entering an ongoing > ckpt_confchg_fn, but needs to start a new ckpt_confchg_fn indeed. > > The second part in that patch which needs to be mentioned here is the > changing of when to assign value to my_saved_ring_id. When the totem > configuration type is still TOTEM_CONFIGURATION_TRANSITIONAL, it seems > to be not very precise if assigning a new ringid value to > my_saved_ring_id, since the old configuration change has not been > finished yet. > > Since this is a random issue, this bug cannot be reproduced every > time when you stop/start the whole cluster stack. However, with that > testing script, it can be reproduced regularly not exceeding 300 times > of loops in our environment. > > Thanks, > Jiaju > > On Fri, 2013-04-26 at 12:20 +0200, Jan Friesse wrote: >> Lidong, >> thanks for patch. Can you please send me your analysis? I would really >> like to understand root case, so now this patch helps. >> >> Regards, >> Honza >> >> Lidong Zhong napsal(a): >>> Hi, >>> We built a cluster consist of three nodes and start/stop one of these nodes repeatedly. The test script is shown >>> like this: >>> 1 #!/bin/sh >>> 2 >>> 3 LOOP_COUNT=1000 >>> 4 >>> 5 while [ $LOOP_COUNT -gt 0 ]; >>> 6 do >>> 7 let "LOOP_COUNT-=1" >>> 8 echo "test No. $((1000-LOOP_COUNT))" >>> 9 rcopenais start >>> 10 sleep 30 >>> 11 rcopenais stop >>> 12 sleep 10 >>> 13 done >>> >>> The error log looks like: >>> Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist >>> Apr 3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist >>> Several times after this error appears first, it leads to this node being fenced. >>> After some analysis, we think there is a race condition between corosync and openais CKPT service. So we formed >>> a patch which can avoid this problem effectively. >>> The patch is attached below. Any review is highly appreciated. >>> Thanks >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss > > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss