Re: Bug when start/stop openais repeatedly

Jiaju Zhang <jjzhang@xxxxxxx> · Tue, 30 Apr 2013 14:39:25 +0800

Hi Jan,

I can provide some background of this patch.

The root cause of this issue is if two membership changes are too
close, before finishing the execution of all the callbacks initiated
by the first membership change, the second membership change arrived.
Since some internal status of those callbacks has not been reset to
initial state, re-enter those callbacks will cause problems.

The problematic callback here is the ckpt_confchg_fn in openais CKPT
service. The mentioned internal status is my_sync_state. So, before
my_sync_state goes back to SYNC_STATE_NOT_STARTED, another membership
change happens, this results in the new callback is wrongly rejected
and will never execute again. Thus, that CKPT cannot be sycned
normally and end up with refcounting of a particular checkpoint is not
correct in the cluster so being deleted then. This is why the user of
CKPT service - ocfs2_controld sees the "Object does not exist", but it
should exist. The solution in that patch is to judge the ringid first.
That being said, if the ringid is not the same, that indicates it is a
new membership change, so it is not re-entering an ongoing
ckpt_confchg_fn, but needs to start a new ckpt_confchg_fn indeed.

The second part in that patch which needs to be mentioned here is the
changing of when to assign value to my_saved_ring_id. When the totem
configuration type is still TOTEM_CONFIGURATION_TRANSITIONAL, it seems
to be not very precise if assigning a new ringid value to
my_saved_ring_id, since the old configuration change has not been
finished yet.

Since this is a random issue, this bug cannot be reproduced every
time when you stop/start the whole cluster stack. However, with that
testing script, it can be reproduced regularly not exceeding 300 times
of loops in our environment.

Thanks,
Jiaju

On Fri, 2013-04-26 at 12:20 +0200, Jan Friesse wrote:
> Lidong,
> thanks for patch. Can you please send me your analysis? I would really
> like to understand root case, so now this patch helps.
> 
> Regards,
>   Honza
> 
> Lidong Zhong napsal(a):
> > Hi,
> >    We built a cluster consist of three nodes and start/stop one of these nodes repeatedly. The test script is shown 
> > like this:
> >   1 #!/bin/sh                                                                   
> >   2 
> >   3 LOOP_COUNT=1000
> >   4 
> >   5 while [ $LOOP_COUNT -gt 0 ];
> >   6 do
> >   7     let "LOOP_COUNT-=1"
> >   8     echo "test No. $((1000-LOOP_COUNT))"
> >   9     rcopenais start
> >  10     sleep 30
> >  11     rcopenais stop
> >  12     sleep 10
> >  13 done
> > 
> > The error log looks like:
> > Apr  3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist
> > Apr  3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist
> > Several times after this error appears first, it leads to this node being fenced.
> > After some analysis, we think there is a race condition between corosync and openais CKPT service. So we formed 
> > a patch which can avoid this problem effectively.
> > The patch is attached below. Any review is highly appreciated.
> > Thanks
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > discuss mailing list
> > discuss@xxxxxxxxxxxx
> > http://lists.corosync.org/mailman/listinfo/discuss
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss