Bug when start/stop openais repeatedly

"Lidong Zhong" <lzhong@xxxxxxxx> · Wed, 24 Apr 2013 20:10:25 -0600

Hi,
   We built a cluster consist of three nodes and start/stop one of these nodes repeatedly. The test script is shown 
like this:
  1 #!/bin/sh                                                                   
  2 
  3 LOOP_COUNT=1000
  4 
  5 while [ $LOOP_COUNT -gt 0 ];
  6 do
  7     let "LOOP_COUNT-=1"
  8     echo "test No. $((1000-LOOP_COUNT))"
  9     rcopenais start
 10     sleep 30
 11     rcopenais stop
 12     sleep 10
 13 done

The error log looks like:
Apr  3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist
Apr  3 11:35:56 hex-3 ocfs2_controld[3623]: Unable to open checkpoint "ocfs2:controld": Object does not exist
Several times after this error appears first, it leads to this node being fenced.
After some analysis, we think there is a race condition between corosync and openais CKPT service. So we formed 
a patch which can avoid this problem effectively.
The patch is attached below. Any review is highly appreciated.
Thanks





Index: openais-1.1.4/services/ckpt.c
===================================================================

--- openais-1.1.4.orig/services/ckpt.c
+++ openais-1.1.4/services/ckpt.c
@@ -776,14 +776,17 @@ static void ckpt_confchg_fn (
 	unsigned int i, j;
 	unsigned int lowest_nodeid;
 
+    if (!memcmp (&my_saved_ring_id, ring_id,sizeof (struct memb_ring_id))) {
+         if (my_sync_state != SYNC_STATE_NOT_STARTED) {
+                 return;
+         }
+	}
+    if (configuration_type != TOTEM_CONFIGURATION_REGULAR) {
+            return;
+    }
+
 	memcpy (&my_saved_ring_id, ring_id,
 		sizeof (struct memb_ring_id));
-       if (configuration_type != TOTEM_CONFIGURATION_REGULAR) {
-                return;
-        }
-        if (my_sync_state != SYNC_STATE_NOT_STARTED) {
-                return;
-        }
 
 	my_sync_state = SYNC_STATE_STARTED;
 
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss